🥾

Bootstrapping for A/B testing

Category
Statistics
Published on
May 1, 2023

Introduction

When running an experiment, sometimes the randomisation unit is different from the analysis unit. In this case, the assumption of independence between each observation may not hold anymore.

Since the independent and identically distributed (i.i.d.) assumption is violated, it is therefore impossible to run a standard test on raw data.

Several options are possible. One is to estimate the true variance with the Delta method, explained in a previous post:

🔼
Delta Method for A/B testing

Another option is to perform bootstrapping, which we discuss in this article.

Python implementation

  1. As experiment data, we have a DataFrame of users that were randomly assigned to the Control or Target group. We recorded their sessions in the app, and the number of sessions that generated a conversion.
  2. The table contains one row per user, with their group, total number of sessions, total number of conversions, and conversion_rate calculated as conversions over sessions:

    | group   | user_id          | sessions | conversions | conversion_rate |
    |:--------|:-----------------|---------:|------------:|----------------:|
    | Control | b0cc6b25669f1cfb |      150 |          62 |        0.413333 |
    | Target  | 1cc2f0c081cff495 |       20 |          11 |        0.550000 |
    | Control | 0dfa929aa7cea87a |       31 |           6 |        0.193548 |
    | Target  | 0dfa929aa7cea87a |       39 |           9 |        0.230769 |
    | Control | e1916d7a661d210f |        3 |           2 |        0.666667 |

    We check summary statistics, with the conversion rate of each group:
    # Summary stats
    df_summary = (
        df
        .assign(conversion_rate=lambda x: (x['conversions']/x['sessions']))
        .groupby(['group'])
        .agg({'user_id': 'count', 'sessions': 'sum', 'conversions': 'sum'})
        .assign(conversion_rate=lambda x: x['conversions']/x['sessions'])
    )
    df_summary
    group
    users
    sessions
    conversions
    conversion_rate
    Control
    488
    37689
    7662
    0.2032
    Target
    493
    45106
    8134
    0.1803

  3. Define a function to calculate the statistic that we’re interested in, i.e. the difference in conversion rates between groups.
  4. Note: another possible option would be to look at the difference in average (unweighted) users conversion rates between groups.

    # Function to get the statistic
    def calculate_statistic(data):
        conv_rates = (
            data
            .groupby('group')
            .agg({'conversions': 'sum', 'sessions': 'sum'})
            .assign(conv_rate=lambda x: x['conversions']/x['sessions'])
            ['conv_rate']
        )
        return conv_rates['Target'] - conv_rates['Control']

  5. Calculate the statistic for the observed data, not yet resampled. In this example, there is a -2.29% difference in the Target group vs Control, as seen from the summary stats in step 1.
  6. # Actual observed difference
    observed_statistic = calculate_statistic(df)
    observed_statistic
    -0.0229

  7. Perform bootstrapping by resampling the data with replacement within each group. In this example we sample 200 times, and select all users with replacement every time (actually 99% because of a bug in pandas).
  8. For each sample, the statistic of difference in conversion rates between groups is returned and appended to an array.

    # Boostrap n times over all users, with replacement
    n_bootstrap = 200
    bootstrap_statistics = []
    
    for i in range(n_bootstrap):
        df_boot = (
            df
            .reset_index()
            .groupby('group')
            .apply(lambda x: x.sample(frac=.99, replace=True))
            .reset_index(drop=True)
        )
        bootstrap_statistics.append(calculate_statistic(df_boot))
    
    bootstrap_statistics = np.array(bootstrap_statistics)

  9. Finally, compute the p-value as the share of bootstrap samples where the absolute difference (because we’re running a two-tailed test) in conversion rates is greater than the global observed difference.
  10. This is the very definition of p-value: if we sampled the data repeatedly, how often would we get a more extreme value than the observed difference?

    # Calculate p-value
    p_value = (np.abs(bootstrap_statistics) >= np.abs(observed_statistic)).sum() / n_bootstrap
    print("p-value: {:.3f}".format(p_value)
    p-value: 0.465

    It happens to be non significant, with a p-value close to 0.5.

Resources