Frequentist A/B testing cheatsheet

Published on
November 17, 2021
Updated on
May 2, 2023


Hypothesis testing, or statistical A/B testing, is a method for comparing two groups or treatments to determine if there is a statistically significant difference between them. The goal of A/B testing is to evaluate the effectiveness of a change or intervention.

In this notebook, we will focus on classical frequentist hypothesis testing, but there are other techniques such as the bayesian methods. More specifically, we’ll test for differences between two samples means (continuous metrics) and proportions.


Common distributions

Some of the most common distributions encountered in hypothesis testing are:
# Setup
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 4, figsize=(16,3))

# Normal distribution
from scipy.stats import norm
normal_x = np.arange(-5, 5, 0.01)
ax[0].plot(normal_x, norm.pdf(normal_x, 0, 1))

# Student-t distribution
from scipy.stats import t
student_x = np.arange(-5, 5, 0.01)
ddof = 2
ax[1].plot(student_x, t.pdf(student_x, ddof))

# Binomial distribution
from scipy.stats import binom
n = 100
p = .5
binom_x = np.arange(n*p-n*p/2, n*p+n*p/2, 1)
ax[2].bar(binom_x, binom.pmf(binom_x, n, p))

# Binomial distribution
from scipy.stats import poisson
poisson_x = np.arange(0, 20, 1)
ax[3].bar(poisson_x, poisson.pmf(poisson_x, mu=3))

# Aesthetics
for axe in ax:

  • Normal distribution: continuous probability distribution that is symmetric and bell-shaped. It is generally used to model the distribution of sample means for continuous data, when the sample size is large and the population standard deviation is known or can be estimated.
  • Student’s t-distribution: similar to the normal distribution but has thicker tails, making it more appropriate when the sample size is small (< 30 observations) or the population standard deviation is unknown.
  • Binomial distribution: discrete probability distribution that is used to model the distribution of binary outcomes, such as clicks, conversions, or success/failure events. For example, it is often used to calculate the difference in conversion rates or click-through rates between two groups.
  • Poisson distribution: discrete probability distribution that is used to model the distribution of rare events, such as the number of purchases or sign-ups. It is often used in A/B testing to analyze count data, such as the number of conversions or clicks, when the rate of occurrence is low.


Here are the most important concepts used in hypothesis testing:

  • Hypotheses: a hypothesis is a statement about the population being tested. In A/B testing, there are two hypotheses: the null hypothesis H0H_0 and the alternative hypothesis H1H_1. The null hypothesis usually states that there is no difference between the two groups, while the alternative hypothesis states that there is a difference.
  • Test statistic: summary statistic that is calculated from the data and is used to determine the likelihood of the null hypothesis being true. It is usually the standardized difference between means or proportions.
  • P-value: probability of obtaining a test statistic as or more extreme than the observed one, assuming that the null hypothesis is true. It is used to determine whether the null hypothesis should be rejected or not. A p-value less than the significance level Ξ±\alpha indicates that the results are statistically significant.
  • Significance level Ξ±: probability of rejecting the null hypothesis when it is actually true. It is usually set at 0.05, which means that there is a 5% chance of rejecting the null hypothesis when it is actually true.
  • Beta value Ξ²: probability of failing to reject the null hypothesis when the alternative hypothesis is actually true. In other words, beta represents the likelihood of not detecting a true effect in the sample. It is often set at 20%.
  • Power (1β€…βˆ’β€…Ξ²): probability of rejecting the null hypothesis when it is actually false. It depends on several factors, such as the sample size, effect size, and significance level.
  • Type I error aka Ξ± error: occurs when the null hypothesis is rejected when it is indeed true.
  • Type II error aka Ξ² error: occurs when the null hypothesis is not rejected, whereas it should because there really is a difference.
  • Sensitivity aka Recall aka True positive rate: measure of the proportion of actual positive cases that are correctly identified as positive.

Here is graphical representation of some of these concepts:

Source: vanbelle.org


Here are the most important statistics and their formulas, for both continuous and proportion data:

Formula for continuous data
Formula for proportions
Sample size
Sample mean
βˆ‘xn\frac{\sum x}{n}
Sample proportion
p^\hat p
Sample variance
βˆ‘(xβˆ’xΛ‰)2nβˆ’1\frac{\sum{(x - \bar x)^2}}{n-1}
p^(1βˆ’p^)\hat p (1- \hat p)
Sample standard deviation
βˆ‘(xβˆ’xΛ‰)2nβˆ’1\sqrt{\frac{\sum{(x - \bar x)^2}}{n-1}}
p^(1βˆ’p^)\sqrt{\hat p (1- \hat p)}
Standard Error of the Mean / of the Proportion
s/ns/\sqrt n
s/ns/\sqrt n

Typically in A/B tests, we compare the means or proportions between two samples (as opposed to comparing a sample to a general population). Here are the basics to calculate statistical significance:

Standard Error (SE)
Test statistic
Confidence Interval of the difference
Difference in samples means
xΛ‰2βˆ’xΛ‰1SE\frac{\bar{x}_2 - \bar{x}_1}{SE}
(x2Λ‰βˆ’x1Λ‰)Β±tβ‹…SE(\bar{x_2} - \bar{x_1}) \pm t \cdot SE
Difference in samples proportions
p^1(1βˆ’p^1)n1+p^2(1βˆ’p^2)n2\sqrt{\frac{\hat p_1 (1-\hat p_1)}{n_1}+\frac{\hat p_2 (1-\hat p_2)}{n_2}}
p^2βˆ’p^1SE\frac{\hat p_2 - \hat p_1 }{SE}
(p^2β€…βˆ’β€…p^1)β€…Β±β€…zβ‹…SE(pΜ‚_2β€…βˆ’β€…pΜ‚_1)β€…Β±β€…z \cdot SE

As a side note, keep in mind that plotting the confidence intervals of each sample mean or proportion does not necessarily reflect the statistical significance of the difference:

It is sometimes claimed that if two independent statistics have overlapping confidence intervals, then they are not significantly different. This is certainly true if there is substantial overlap. However, the overlap can be surprisingly large and the means still significantly different. Confidence intervals associated with statistics can overlap as much as 29% and the statistics can still be significantly different.

– Gerald van Belle, Statistical Rules of Thumb

Implementation in Python

Continuous data

Let’s use the formulas above to test for the statistical difference between two samples means. We’ll first compute the formulas manually, then check our results with SciPy functions out-of-the-box.

  1. Create sample data, with two normally distributed samples, and plot the distributions:
  2. # Import libraries
    import numpy as np
    import scipy.stats as st
    import seaborn as sns
    import matplotlib.pyplot as plt
    # Create two normally distributed samples
    h1 = np.random.normal(loc=10, scale=2, size=100)
    h2 = np.random.normal(loc=10.1, scale=2, size=80)
    # Plot distributions
    fig, ax = plt.subplots(figsize=(8,4))
    sns.histplot(h1, binwidth=1, color='steelblue')
    sns.histplot(h2, binwidth=1, color='green')
    ax.axvline(np.mean(h1), linestyle='--', color='darkblue')
    ax.axvline(np.mean(h2), linestyle='--', color='darkgreen')

  3. Manually calculate the results, applying the formulas from the previous section:
  4. # Sample sizes
    n1 = len(h1); n2 = len(h2)
    # Means
    x1 = np.mean(h1); x2 = np.mean(h2)
    # Standard deviations
    s1 = np.std(h1, ddof=1); s2 = np.std(h2, ddof=1)
    # t-statistic
    t = (x2 - x1) / np.sqrt(s1**2/n1 + s2**2/n2)
    # Print results
    print("Difference in means: {:.4f}".format(x2 - x1))
    print("t-score: {:.4f}".format(t))
    print("p-value: {:.4f}".format(st.t.sf(abs(t), df=n1+n2-2) *2))   # Multiply by 2 for two-tailed test
    Difference in means: 0.5219
    t-score: 1.5606
    p-value: 0.1204

    With a p-value above 0.05, there is no significant difference with 95% confidence between the means of the groups.

  5. Check the results with SciPy, with a simple one-liner:
  6. # Check results with scipy
    t_test = st.ttest_ind(h2, h1, alternative='two-sided', equal_var=False)
    print("t-score: {:.4f}\np-value: {:.4f}".format(t_test[0], t_test[1]))
    t-score: 1.5606
    p-value: 0.1206

    As expected, we get the same results than previously.


Let’s now calculate in Python a difference in proportions between two groups, just like we did for continous metrics.

  1. Generate sample data:
  2. # Import libraries
    import numpy as np
    import scipy.stats as st
    # Create two binomial samples
    n1 = 1000; n2 = 800
    k1 = 150; k2 = 140
    # Compute proportions
    p1 = k1/n1
    p2 = k2/n2
    p = (k1+k2)/(n1+n2)

  3. Manually calculate the results, applying the formulas:
  4. # Standard Error of the Proportion
    sep = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
    # z-statistic
    z = (p2 - p1) / sep
    print("z-score: {:.4f}".format(z))
    print("p-value: {:.4f}".format(st.norm.sf(abs(z))*2))   # Multiply by 2 for two-tailed test
    z-score: 1.4246
    p-value: 0.1543

    Since the p-value is >0.05, we cannot conclude to a significant difference.

  5. Double-check the results with the SciPy function:
  6. # Check with scipy
    prop_t_test = st.ttest_ind_from_stats(
        p2, np.sqrt(p2*(1-p2)), n2, 
        p1, np.sqrt(p1*(1-p1)), n1,
    print("t-score: {:.4f}\np-value: {:.4f}".format(prop_t_test[0], prop_t_test[1]))
    t-score: 1.4246
    p-value: 0.1545

    And as expected, we get the exact same results.

Evolution of significance

It’s strictly forbidden to look at the results before the end of an experiment (this is known as data peeking), but an interesting view to build is the evolution of difference between groups over time (with a 95% confidence interval), as well as the evolution of p-value.


The example plots above show that, after about 15 days in the experiment, the difference between Control and Target groups becomes clearly significant, and the p-value stays well below the 0.05 threshold.