Blog > Sama Engineering
Part 3: A/B Testing with Python

Part 3: A/B Testing with Python

We’ve previously explored the foundations of Experiment Driven Design and A/B Testing. Today we’ll dig into A/B Testing with Python because analysis can be easily automated with existing open source python libraries. In this post we will explore their usage with an example.

To orient the reader, we will state a few definitions to anchor the example:

Control Group: current user interface.
Experiment Group: rearranged point annotation button.

H0: The mean of the annotation time for the control group is the same as the mean of the annotation time for the experiment group; there is no effect from rearranging the point annotation button.

H1: The mean of the annotation time for the control group is different from the mean of the annotation time for  the experiment group; there is an observable effect from rearranging the point annotation button.

Let us assume that we ran an A/B Test feature experiment for two weeks. The UI modifications consisted of rearranging a button used in the process of drawing polygons around objects Let us assume these were recorded annotation times per image in minutes, for users of each variant (it can be represented as a python list):

A. Control group (original arrangement):

variant_a = [150, 195, 120, 160, 97, 20, 100, 121, 250, 300, 80, 75, 100, 196, 147, 120, 100, 190, 57, 100, 157, 186, 91, 190, 210, 222, 192, 243, 99, 151]

B. Experiment group (rearranged button):

variant_b = [120, 110, 96, 99, 87, 55, 43, 83, 200, 100, 125, 140, 75, 91, 141, 121, 250, 35, 94, 65, 85, 67, 93, 161, 35, 34, 111, 124, 85, 103]

1. Run the t-test from the scipy.stats module of scipy (a mathematical, scientific and engineering library).

import scipy.stats as stats

t, p = stats.ttest_ind(variant_a, variant_b, equal_var=False)

2. Calculate the degrees of freedom according to Welch’s t-test definition which is the one implemented in stats.ttest_ind

# For illustrative details see Wikipedia

s1 = np.std(variant_a)
s2 = np.std(variant_b)
n1 = len(variant_a)
n2 = len(variant_b)

df = np.floor(((((s1 ** 2) / n1) + ((s2 ** 2) / n2)) ** 2) /
               (((s1 ** 4) / ((n1 ** 2) * (n1 - 1))) + ((s2 ** 4) / ((n2 ** 2) * (n2 - 1)))))

3. Now, using the same scipy.stats library, get the t-critical value for 95% or an alpha of 0.05 (1 – confidence level) from the t distribution’s ppf (percent point function) function and evaluate the t statistic from the previous step. If it falls in the range  [-t-critical, t-critical] then H0 cannot be rejected, if it is outside, then we can reject H0 in favor of H1:

alpha = 0.05
t_critical_value = stats.t.ppf(1 - (alpha/2), df)
null_hypothesis = bool(t_critical_value >= t_value >= -t_critical_value)

4. The confidence interval of variant_b (experiment) will help us visualize the difference between the two variants. If the mean of the control group doesn’t fall inside of this interval then the means of the two groups are significantly apart from each other, which suggests that the results are statistically significant.

s = np.std(variant_b)
x = np.mean(variant_b)
n = len(variant_b)
rho = (t_critical_value * s) / np.sqrt(n)
conf_int = x - rho, x + rho
pasted image 0-1
5. Statistical power is the probability that the test correctly rejects the null hypothesis, in other words, the probability of a true positive result. This is only useful when the null hypothesis is rejected. A low value of power could be an indication that the sample size is not big enough yet to validate the results. To calculate the statistical power we use the class TTestIndPower from the module statsmodels.stats.power of the statsmodel library.

from statsmodels.stats.power import TTestIndPower

# Effect size based on Cohen’s d formula.

x1 = np.mean(variant_a)
x2 = np.mean(variant_b)
s1 = np.std(variant_a)
s2 = np.std(variant_b)
n1 = len(variant_a)
n2 = len(variant_b)

s = np.sqrt((((n1 - 1) * (s1 ** 2)) + ((n2 - 1) * (s2 ** 2))) / (n1 + n2 - 2))
effect = np.abs((x1 - x2) / s)

power = TTestIndPower().power(effect, nobs1=n1, ratio=n2 / n1, df=(n1 + n2 - 2), alpha=alpha)

6. Plot the sample distributions with confidence intervals as a visual aid using matplotlib library.

import matplotlib.pyplot as plt

# Control
fig, ax = plt.subplots(figsize=(12,6))
xA = np.linspace(40, x1 + 3*s1, 100)
yA = stats.norm(loc=x1, scale=s1).pdf(xA)
ax.plot(xA, yA, c='red', label='Variant A Distribution')
ax.axvline(x=x1, c='red', alpha=0.5, linestyle='--', label='Variant A')

# Experimental
xB = np.linspace(40, x2 + 3*s2, 100)
yB = stats.norm(loc=x2, scale=s2).pdf(xB) 
ax.plot(xB, yB, c='blue', label='Variant B Distribution')
ax.axvline(x=x2, c='blue', alpha=0.5, linestyle='--', label='Variant B')

# Confidence interval
ax.axvline(conf_int[0], c='green', linestyle='--', alpha=0.5)
ax.axvline(conf_int[1], c='green', linestyle='--', alpha=0.5, label='Confidence Interval')

plt.xlabel('Annotation Time')
plt.ylabel('Percent of Tasks per Annotation Time')
plt.title('Annotation Time Distributions')
plt.legend()
plt.show()

pasted image 0 (1)

Related Resources

Sama’s Experiment-Driven Approach to Solving for High-Quality Labels at Scale

6 Min Read

Part 2: A/B Testing

4 Min Read

Part 1: Experiment Driven Development

3 Min Read