Delta Method for A/B testing

Published on
April 27, 2023


When running an experiment, sometimes the randomisation unit is different from the analysis unit. In this case, the assumption of independence between each observation may not hold anymore.

Since the independent and identically distributed (i.i.d.) assumption is violated, “the naïve variance calculation will likely underestimate variance, leading to false detection of changes that are actually within normal variation” (source). For this reason, it’s not possible to apply a z-test (or t-test) without adjustement.

The Delta method provides a technique to estimate the correct variance when the randomization and analysis units don’t match. After adjusting the variance with Delta, standard tests can be used.

Let’s take an example

Imagine you’re running an A/B test for an app, where you want to test a new feature on a subset of users that should increase their conversion rate, i.e. the proportion of sessions that end up in a conversion.

You start by selecting at random two groups of users, for Control and Target. Then you roll out your new feature to users in the target group. And you want to measure the effect on the conversion rate.

However, the probabilities that visits end up in a conversion for user are not independent, they are clearly correlated. Users have different behaviours, and some users have a consistently higher or lower conversion rate than others.

Note that, the higher the number of observations per randomisation unit (e.g. sessions per user), the more distortion in variance there is. In our example, the difference in variance is far worse if the experiment runs for two months, during which users may generate 30 sessions, rather than one week, where they will have ~3 sessions.

Formula and implementation

The Delta method estimates the true variance based on the Central Limit Theorem and using a Taylor expansion. The full formula is the following:

In Python, this can be written as:

import numpy as np
delta_variance = \
    (np.var(X, ddof=1) / np.mean(Y)**2 + \
    np.var(Y, ddof=1)*(np.mean(X)**2 / np.mean(Y)**4) - \
    2*np.cov(X, Y, ddof=1)[0][1]*(np.mean(X)/np.mean(Y)**3)) / len(Y)

where X and Y are lists (or any kind of iterable like pd.Series):

  • X is the ratio nominator, e.g. number of conversions per user
  • Y is the ratio denominator, e.g. number of sessions per user

Implementing our example

There are 3 simple steps, provided that you have the data in the correct format.

  1. As experiment data, we have a DataFrame of users that were randomly assigned to the Control or Target. We recorded their session in the app, and the number of sessions that generated a conversion.
  2. The table contains one row per user, with their group, total number of sessions, total number of conversions, and conversion_rate calculated as conversions over sessions:

    | group   | user_id          | sessions | conversions | conversion_rate |
    | Control | b0cc6b25669f1cfb |      150 |          62 |        0.413333 |
    | Target  | 1cc2f0c081cff495 |       20 |          11 |        0.550000 |
    | Control | 0dfa929aa7cea87a |       31 |           6 |        0.193548 |
    | Target  | 0dfa929aa7cea87a |       39 |           9 |        0.230769 |
    | Control | e1916d7a661d210f |        3 |           2 |        0.666667 |

    We check the summary statistics for each group:
    # Summary stats
        .agg({'user_id': 'count', 'sessions': 'sum', 'conversions': 'sum'})
        .assign(conversion_rate=lambda x: (x['conversions']/x['sessions']).round(3))
    And we can plot the distributions of the variances:
    # Plot distributions
    import seaborn as sns
    import matplotlib.pyplot as plt
        df.assign(variance=lambda x: x['conversion_rate'] * (1-x['conversion_rate'])), 
        x='variance', hue='group', bins=50, kde=True
    Nope, the variances don’t look normally distributed
    Nope, the variances don’t look normally distributed

  3. We now calculate the Delta-estimated variances for control and target groups.
  4. First we write the function:

    # Function to get the Delta adjusted variance of the ratio X/Y
    def delta_var(x, y):
        mean_x = np.mean(x)
        mean_y = np.mean(y)
        var_x = np.var(x, ddof=1)
        var_y = np.var(y, ddof=1)
        cov_xy = np.cov(x,y,ddof=1)[0][1]
        delta_variance = \
            (var_x/mean_y**2 + var_y*(mean_x**2/mean_y**4) - \
            2*cov_xy*(mean_x/mean_y**3)) / len(y)
        return delta_variance

    Then we apply the function to each group, and get the estimated variances:

    # Compute estimated variances
    var_delta_c = delta_var(
        x=df.loc['Control', 'conversions'], 
        y=df.loc['Control', 'sessions']
    var_delta_t = delta_var(
        x=df.loc['Target', 'conversions'], 
        y=df.loc['Target', 'sessions']

  5. We can finally apply a usual z-test or t-test with the adjusted variances:
  6. # Compute z-test
    from scipy import stats
    mean_c = df.loc['Control', 'conversions'].sum() / df.loc['Control', 'sessions'].sum()
    mean_t = df.loc['Target', 'conversions'].sum() / df.loc['Target', 'sessions'].sum()
    mean_diff = mean_t - mean_c
    var_delta = var_delta_c + var_delta_t
    delta_z_score = mean_diff / np.sqrt(var_delta)
    delta_p_value = stats.norm.sf(abs(delta_z_score)) * 2
        'Control mean': mean_c.round(3),
        'Target mean': mean_t.round(3),
        'Difference': mean_diff.round(3),
        'Control variance': var_delta_c,
        'Target variance': var_delta_t,
        'z-score': delta_z_score.round(4),
        'p-value': delta_p_value.round(4),
    }, index=['value']).transpose()
    Control mean
    Target mean
    Control variance
    Target variance

    The p-value is 0.19 and cannot be considered significant with 95% confidence.

Sample size estimation

Calculate the sample size for A/B testing