Introduction
When running an experiment, sometimes the randomisation unit is different from the analysis unit. In this case, the assumption of independence between each observation may not hold anymore.
Since the independent and identically distributed (i.i.d.) assumption is violated, it is therefore impossible to run a standard test on raw data.
Several options are possible. One is to estimate the true variance with the Delta method, explained in a previous post:
Another option is to perform bootstrapping, which we discuss in this article.
Python implementation
- As experiment data, we have a DataFrame of users that were randomly assigned to the Control or Target group. We recorded their sessions in the app, and the number of sessions that generated a conversion.
- Define a function to calculate the statistic that we’re interested in, i.e. the difference in conversion rates between groups.
- Calculate the statistic for the observed data, not yet resampled. In this example, there is a -2.29% difference in the Target group vs Control, as seen from the summary stats in step 1.
- Perform bootstrapping by resampling the data with replacement within each group. In this example we sample 200 times, and select all users with replacement every time (actually 99% because of a bug in pandas).
- Finally, compute the p-value as the share of bootstrap samples where the absolute difference (because we’re running a two-tailed test) in conversion rates is greater than the global observed difference.
The table contains one row per user
, with their group
, total number of sessions
, total number of conversions
, and conversion_rate
calculated as conversions over sessions:
| group | user_id | sessions | conversions | conversion_rate |
|:--------|:-----------------|---------:|------------:|----------------:|
| Control | b0cc6b25669f1cfb | 150 | 62 | 0.413333 |
| Target | 1cc2f0c081cff495 | 20 | 11 | 0.550000 |
| Control | 0dfa929aa7cea87a | 31 | 6 | 0.193548 |
| Target | 0dfa929aa7cea87a | 39 | 9 | 0.230769 |
| Control | e1916d7a661d210f | 3 | 2 | 0.666667 |
# Summary stats
df_summary = (
df
.assign(conversion_rate=lambda x: (x['conversions']/x['sessions']))
.groupby(['group'])
.agg({'user_id': 'count', 'sessions': 'sum', 'conversions': 'sum'})
.assign(conversion_rate=lambda x: x['conversions']/x['sessions'])
)
df_summary
group | users | sessions | conversions | conversion_rate |
Control | 488 | 37689 | 7662 | 0.2032 |
Target | 493 | 45106 | 8134 | 0.1803 |
Note: another possible option would be to look at the difference in average (unweighted) users conversion rates between groups.
# Function to get the statistic
def calculate_statistic(data):
conv_rates = (
data
.groupby('group')
.agg({'conversions': 'sum', 'sessions': 'sum'})
.assign(conv_rate=lambda x: x['conversions']/x['sessions'])
['conv_rate']
)
return conv_rates['Target'] - conv_rates['Control']
# Actual observed difference
observed_statistic = calculate_statistic(df)
observed_statistic
-0.0229
For each sample, the statistic of difference in conversion rates between groups is returned and appended to an array.
# Boostrap n times over all users, with replacement
n_bootstrap = 200
bootstrap_statistics = []
for i in range(n_bootstrap):
df_boot = (
df
.reset_index()
.groupby('group')
.apply(lambda x: x.sample(frac=.99, replace=True))
.reset_index(drop=True)
)
bootstrap_statistics.append(calculate_statistic(df_boot))
bootstrap_statistics = np.array(bootstrap_statistics)
This is the very definition of p-value: if we sampled the data repeatedly, how often would we get a more extreme value than the observed difference?
# Calculate p-value
p_value = (np.abs(bootstrap_statistics) >= np.abs(observed_statistic)).sum() / n_bootstrap
print("p-value: {:.3f}".format(p_value)
p-value: 0.465
It happens to be non significant, with a p-value close to 0.5.