🥾

# Bootstrapping for A/B testing

Category
Statistics
Published on
May 1, 2023
📖  Table of content

# Introduction

When running an experiment, sometimes the randomisation unit is different from the analysis unit. In this case, the assumption of independence between each observation may not hold anymore.

Since the independent and identically distributed (i.i.d.) assumption is violated, it is therefore impossible to run a standard test on raw data.

Several options are possible. One is to estimate the true variance with the Delta method, explained in a previous post:

🔼
Delta Method for A/B testing

Another option is to perform bootstrapping, which we discuss in this article.

# Python implementation

1. As experiment data, we have a DataFrame of users that were randomly assigned to the Control or Target group. We recorded their sessions in the app, and the number of sessions that generated a conversion.
2. The table contains one row per `user`, with their `group`, total number of `sessions`, total number of `conversions`, and `conversion_rate` calculated as conversions over sessions:

``````| group   | user_id          | sessions | conversions | conversion_rate |
|:--------|:-----------------|---------:|------------:|----------------:|
| Control | b0cc6b25669f1cfb |      150 |          62 |        0.413333 |
| Target  | 1cc2f0c081cff495 |       20 |          11 |        0.550000 |
| Control | 0dfa929aa7cea87a |       31 |           6 |        0.193548 |
| Target  | 0dfa929aa7cea87a |       39 |           9 |        0.230769 |
| Control | e1916d7a661d210f |        3 |           2 |        0.666667 |``````

We check summary statistics, with the conversion rate of each group:
``````# Summary stats
df_summary = (
df
.assign(conversion_rate=lambda x: (x['conversions']/x['sessions']))
.groupby(['group'])
.agg({'user_id': 'count', 'sessions': 'sum', 'conversions': 'sum'})
.assign(conversion_rate=lambda x: x['conversions']/x['sessions'])
)
df_summary``````
 group users sessions conversions conversion_rate Control 488 37689 7662 0.2032 Target 493 45106 8134 0.1803

3. Define a function to calculate the statistic that we’re interested in, i.e. the difference in conversion rates between groups.
4. Note: another possible option would be to look at the difference in average (unweighted) users conversion rates between groups.

``````# Function to get the statistic
def calculate_statistic(data):
conv_rates = (
data
.groupby('group')
.agg({'conversions': 'sum', 'sessions': 'sum'})
.assign(conv_rate=lambda x: x['conversions']/x['sessions'])
['conv_rate']
)
return conv_rates['Target'] - conv_rates['Control']``````

5. Calculate the statistic for the observed data, not yet resampled. In this example, there is a -2.29% difference in the Target group vs Control, as seen from the summary stats in step 1.
6. ``````# Actual observed difference
observed_statistic = calculate_statistic(df)
observed_statistic``````
``-0.0229``

7. Perform bootstrapping by resampling the data with replacement within each group. In this example we sample 200 times, and select all users with replacement every time (actually 99% because of a bug in pandas).
8. For each sample, the statistic of difference in conversion rates between groups is returned and appended to an array.

``````# Boostrap n times over all users, with replacement
n_bootstrap = 200
bootstrap_statistics = []

for i in range(n_bootstrap):
df_boot = (
df
.reset_index()
.groupby('group')
.apply(lambda x: x.sample(frac=.99, replace=True))
.reset_index(drop=True)
)
bootstrap_statistics.append(calculate_statistic(df_boot))

bootstrap_statistics = np.array(bootstrap_statistics)``````

9. Finally, compute the p-value as the share of bootstrap samples where the absolute difference (because we’re running a two-tailed test) in conversion rates is greater than the global observed difference.
10. This is the very definition of p-value: if we sampled the data repeatedly, how often would we get a more extreme value than the observed difference?

``````# Calculate p-value
p_value = (np.abs(bootstrap_statistics) >= np.abs(observed_statistic)).sum() / n_bootstrap
print("p-value: {:.3f}".format(p_value)``````
``p-value: 0.465``

It happens to be non significant, with a p-value close to 0.5.