After last week’s intro into Experiment Driven Development at Sama, we’ll go further into A/B Testing today. A/B Testing is a randomized experiment method to compare how two populations behave in a controlled environment and determine whether the variation of some target metrics defined are significant or not, to determine that the experiment yields better results than the alternative.

We say that a baseline sample (variation A), which normally refers to some existing system, is compared against an experimental treatment (variation B). In software development, samples drawn from the two populations will be used to analyze the metrics associated with the usage of features. We generally want to make sure that any new feature has a positive impact: improved usability, lower duration to finish a process, etc. We should aim to have data to back up our claims that a feature has benefits, otherwise it would be fair to question why we want to deploy a new feature.

One way of evaluating an A/B experiment is through the use of a t-test which works well when we expect the distribution to be normal and it also allows us to not worry about the unknown standard deviation of our data. Note, however, that population distributions are not always expected to be normal, of course, and the t-test can be replaced by some other appropriate hypothesis test, depending on the distribution of the data, e.g. Kolmogovor-Smirnov. In our case, we want to determine if the means of the two data samples are significantly different from each other, with a given confidence level (90%, 95% and 99% are commonly used values).

The framework of the experiment revolves around the definition of a hypothesis for an A/B Test as follows:

H_{0}: The mean of the baseline metric is the same as the mean of the experiment metric; there is no effect from the treatment (variation) of the experiment, thus the two means belong to the same population.

H1: The mean of the baseline metric is different from the mean of the experiment metric; there is an observable effect from the treatment (variation) of the experiment metric and the two means belong to different populations.

We then define the timeline for the feature/process experiment, run the experiment and collect the observations from the samples of the two variants. We should have a notion in terms of how long we want to run this to collect enough information (we will treat this as out of scope on this piece, however).

In order to perform the t-test evaluation, we need per sample sizes (N), means (X) and the standard deviations (s) to calculate the statistic t and the degrees of freedom (v). This is calculated as follows (using Welch’s t-test for independent variables with unequal variances and unequal sample sizes):

Once the statistic and degrees of freedom are calculated, along with an (1-confidence level), we can evaluate our hypothesis against the t-distribution table to determine whether or not the populations are different.

If our **t** statistic is less than the value from the table, given the degrees of freedom and the significance level, we can reject the null hypothesis as there is enough evidence to determine that sample means are from different populations. That would mean our new feature is really different from the control and we can release it, if the direction of the variation is in line with our objectives (e.g. lower means when we want to lower durations is good).

There are other more nuanced circumstances that we can address with a similar framework. For instance, we may want to test two variants of a new feature side by side. We may also want to use a slightly different statistical tool than a t-test, depending on what our interests are. What is important is the test-driven culture that should be fostered within organizations to have a data-drive approach to justify the release of new features.

Next up: A/B Testing with Python.