In this series of three we’ll go into Experiment Driven Development and A/B Testing. EDD is fact-based development: based on evidence, not intuition.

We've previously explored the foundations of Experiment Driven Design and A/B Testing. Today we'll dig into A/B Testing with Python because analysis can be easily automated with existing open source python libraries. In this post we will explore their usage with an example.

To orient the reader, we will state a few definitions to anchor the example:

- Control Group: current user interface.
- Experiment Group: rearranged point annotation button.
- H
_{0}: The mean of the annotation time for the control group is the same as the mean of the annotation time for the experiment group; there is no effect from rearranging the point annotation button. - H
_{1}: The mean of the annotation time for the control group is different from the mean of the annotation time for the experiment group; there is an observable effect from rearranging the point annotation button.

Let us assume that we ran an A/B Test feature experiment for two weeks. The UI modifications consisted of rearranging a button used in the process of drawing polygons around objects

Let us assume these were recorded annotation times per image in minutes, for users of each variant (it can be represented as a python list):

A. Control group (original arrangement):`variant_a = `

``

B. Experiment group (rearranged button):`variant_b = `

``

**1.** Run the t-test from the scipy.stats module of scipy (a mathematical, scientific and engineering library).`import scipy.stats as statst, p = stats.ttest_ind(variant_a, variant_b, equal_var=False)`

``

**2.** Calculate the degrees of freedom according to Welch’s t-test definition which is the one implemented in stats.ttest_ind# For illustrative details see Wikipedia

`s1 = np.std(variant_a)s2 = np.std(variant_b)n1 = len(variant_a)n2 = len(variant_b)df = np.floor(((((s1 ** 2) / n1) + ((s2 ** 2) / n2)) ** 2) / (((s1 ** 4) / ((n1 ** 2) * (n1 - 1))) + ((s2 ** 4) / ((n2 ** 2) * (n2 - 1)))))`

``

**3.** Now, using the same scipy.stats library, get the t-critical value for 95% or an alpha of 0.05 (1 - confidence level) from the **t** distribution’s **ppf **(percent point function) function and evaluate the t statistic from the previous step. If it falls in the range then H_{0 }cannot be rejected, if it is outside, then we can reject H_{0} in favor of H_{1}:`alpha = 0.05t_critical_value = stats.t.ppf(1 - (alpha/2), df)null_hypothesis = bool(t_critical_value >= t_value >= -t_critical_value)`

``

**4.** The **confidence interval** of variant_b (experiment) will help us visualize the difference between the two variants. If the mean of the control group doesn’t fall inside of this interval then the means of the two groups are significantly apart from each other, which suggests that the results are statistically significant.`s = np.std(variant_b)x = np.mean(variant_b)n = len(variant_b)rho = (t_critical_value * s) / np.sqrt(n)conf_int = x - rho, x + rho`

**5. Statistical power** is the probability that the test correctly rejects the null hypothesis, in other words, the probability of a true positive result. This is only useful when the null hypothesis is rejected. A low value of power could be an indication that the sample size is not big enough yet to validate the results. To calculate the statistical power we use the class *TTestIndPower* from the module *statsmodels.stats.power* of the statsmodel library.from statsmodels.stats.power import TTestIndPower# Effect size based on *Cohen’s d* formula.

`x1 = np.mean(variant_a)x2 = np.mean(variant_b)s1 = np.std(variant_a)s2 = np.std(variant_b)n1 = len(variant_a)n2 = len(variant_b)s = np.sqrt((((n1 - 1) * (s1 ** 2)) + ((n2 - 1) * (s2 ** 2))) / (n1 + n2 - 2))effect = np.abs((x1 - x2) / s)power = TTestIndPower().power(effect, nobs1=n1, ratio=n2 / n1, df=(n1 + n2 - 2), alpha=alpha)`

**6.** Plot the sample distributions with confidence intervals as a visual aid using *matplotlib* library.`import matplotlib.pyplot as plt# Controlfig, ax = plt.subplots(figsize=(12,6))xA = np.linspace(40, x1 + 3*s1, 100)yA = stats.norm(loc=x1, scale=s1).pdf(xA)ax.plot(xA, yA, c='red', label='Variant A Distribution')ax.axvline(x=x1, c='red', alpha=0.5, linestyle='--', label='Variant A')# ExperimentalxB = np.linspace(40, x2 + 3*s2, 100)yB = stats.norm(loc=x2, scale=s2).pdf(xB) ax.plot(xB, yB, c='blue', label='Variant B Distribution')ax.axvline(x=x2, c='blue', alpha=0.5, linestyle='--', label='Variant B')# Confidence intervalax.axvline(conf_int, c='green', linestyle='--', alpha=0.5)ax.axvline(conf_int, c='green', linestyle='--', alpha=0.5, label='Confidence Interval')plt.xlabel('Annotation Time')plt.ylabel('Percent of Tasks per Annotation Time')plt.title('Annotation Time Distributions')plt.legend()plt.show()`

Author

Rafael Alfaro

RESOURCES