Part 3: A/B Testing with Python

In this series of three we’ll go into Experiment Driven Development and A/B Testing. EDD is fact-based development: based on evidence, not intuition.

Thank you! Your submission has been received.
We'll get back to you as soon as possible.

In the meantime, we invite you to check out our free resources to help you grow your service business!

Free Resources

Oops! Something went wrong while submitting the form.

Table of Contents

Loading....

Talk to an Expert

We've previously explored the foundations of Experiment Driven Design and A/B Testing. Today we'll dig into A/B Testing with Python because analysis can be easily automated with existing open source python libraries. In this post we will explore their usage with an example.

To orient the reader, we will state a few definitions to anchor the example:

Control Group: current user interface.
Experiment Group: rearranged point annotation button.
H₀: The mean of the annotation time for the control group is the same as the mean of the annotation time for the experiment group; there is no effect from rearranging the point annotation button.
H₁: The mean of the annotation time for the control group is different from the mean of the annotation time for the experiment group; there is an observable effect from rearranging the point annotation button.

Let us assume that we ran an A/B Test feature experiment for two weeks. The UI modifications consisted of rearranging a button used in the process of drawing polygons around objects

Let us assume these were recorded annotation times per image in minutes, for users of each variant (it can be represented as a python list):

A. Control group (original arrangement):variant_a =

‍B. Experiment group (rearranged button):variant_b =

‍

‍1. Run the t-test from the scipy.stats module of scipy (a mathematical, scientific and engineering library).import scipy.stats as statst, p = stats.ttest_ind(variant_a, variant_b, equal_var=False)

‍2. Calculate the degrees of freedom according to Welch’s t-test definition which is the one implemented in stats.ttest_ind# For illustrative details see Wikipedia

‍s1 = np.std(variant_a)s2 = np.std(variant_b)n1 = len(variant_a)n2 = len(variant_b)df = np.floor(((((s1 ** 2) / n1) + ((s2 ** 2) / n2)) ** 2) / (((s1 ** 4) / ((n1 ** 2) * (n1 - 1))) + ((s2 ** 4) / ((n2 ** 2) * (n2 - 1)))))

‍3. Now, using the same scipy.stats library, get the t-critical value for 95% or an alpha of 0.05 (1 - confidence level) from the t distribution’s ppf (percent point function) function and evaluate the t statistic from the previous step. If it falls in the range then H₀cannot be rejected, if it is outside, then we can reject H₀ in favor of H₁:alpha = 0.05t_critical_value = stats.t.ppf(1 - (alpha/2), df)null_hypothesis = bool(t_critical_value >= t_value >= -t_critical_value)

‍4. The confidence interval of variant_b (experiment) will help us visualize the difference between the two variants. If the mean of the control group doesn’t fall inside of this interval then the means of the two groups are significantly apart from each other, which suggests that the results are statistically significant.s = np.std(variant_b)x = np.mean(variant_b)n = len(variant_b)rho = (t_critical_value * s) / np.sqrt(n)conf_int = x - rho, x + rho

5. Statistical power is the probability that the test correctly rejects the null hypothesis, in other words, the probability of a true positive result. This is only useful when the null hypothesis is rejected. A low value of power could be an indication that the sample size is not big enough yet to validate the results. To calculate the statistical power we use the class TTestIndPower from the module statsmodels.stats.power of the statsmodel library.from statsmodels.stats.power import TTestIndPower# Effect size based on Cohen’s d formula.
‍x1 = np.mean(variant_a)x2 = np.mean(variant_b)s1 = np.std(variant_a)s2 = np.std(variant_b)n1 = len(variant_a)n2 = len(variant_b)s = np.sqrt((((n1 - 1) * (s1 ** 2)) + ((n2 - 1) * (s2 ** 2))) / (n1 + n2 - 2))effect = np.abs((x1 - x2) / s)power = TTestIndPower().power(effect, nobs1=n1, ratio=n2 / n1, df=(n1 + n2 - 2), alpha=alpha)6. Plot the sample distributions with confidence intervals as a visual aid using matplotlib library.import matplotlib.pyplot as plt# Controlfig, ax = plt.subplots(figsize=(12,6))xA = np.linspace(40, x1 + 3*s1, 100)yA = stats.norm(loc=x1, scale=s1).pdf(xA)ax.plot(xA, yA, c='red', label='Variant A Distribution')ax.axvline(x=x1, c='red', alpha=0.5, linestyle='--', label='Variant A')# ExperimentalxB = np.linspace(40, x2 + 3*s2, 100)yB = stats.norm(loc=x2, scale=s2).pdf(xB) ax.plot(xB, yB, c='blue', label='Variant B Distribution')ax.axvline(x=x2, c='blue', alpha=0.5, linestyle='--', label='Variant B')# Confidence intervalax.axvline(conf_int, c='green', linestyle='--', alpha=0.5)ax.axvline(conf_int, c='green', linestyle='--', alpha=0.5, label='Confidence Interval')plt.xlabel('Annotation Time')plt.ylabel('Percent of Tasks per Annotation Time')plt.title('Annotation Time Distributions')plt.legend()plt.show()