# Introduction

When running an experiment, sometimes the **randomisation unit is different from the analysis unit.** In this case, the assumption of independence between each observation may not hold anymore.

Since the independent and identically distributed (i.i.d.) assumption is violated, it is therefore impossible to run a standard test on raw data.

Several options are possible. One is to estimate the true variance with the Delta method, explained in a previous post:

Another option is to perform bootstrapping, which we discuss in this article.

# Python implementation

**As experiment data, we have a DataFrame of users**that were randomly assigned to the Control or Target group. We recorded their sessions in the app, and the number of sessions that generated a conversion.**Define a function to calculate the statistic**that weâ€™re interested in, i.e. the difference in conversion rates between groups.**Calculate the statistic for the observed data**, not yet resampled. In this example, there is a -2.29% difference in the Target group vs Control, as seen from the summary stats in step 1.**Perform bootstrapping by resampling the data with replacement**within each group. In this example we sample 200 times, and select all users with replacement every time (actually 99% because of a bug in pandas).**Finally, compute the p-value**as the share of bootstrap samples where the*absolute*difference (because weâ€™re running a two-tailed test) in conversion rates is greater than the global observed difference.

The table contains **one row per **** user**, with their

`group`

, total number of `sessions`

, total number of `conversions`

, and `conversion_rate`

calculated as conversions over sessions:```
| group | user_id | sessions | conversions | conversion_rate |
|:--------|:-----------------|---------:|------------:|----------------:|
| Control | b0cc6b25669f1cfb | 150 | 62 | 0.413333 |
| Target | 1cc2f0c081cff495 | 20 | 11 | 0.550000 |
| Control | 0dfa929aa7cea87a | 31 | 6 | 0.193548 |
| Target | 0dfa929aa7cea87a | 39 | 9 | 0.230769 |
| Control | e1916d7a661d210f | 3 | 2 | 0.666667 |
```

```
# Summary stats
df_summary = (
df
.assign(conversion_rate=lambda x: (x['conversions']/x['sessions']))
.groupby(['group'])
.agg({'user_id': 'count', 'sessions': 'sum', 'conversions': 'sum'})
.assign(conversion_rate=lambda x: x['conversions']/x['sessions'])
)
df_summary
```

group | users | sessions | conversions | conversion_rate |

Control | 488 | 37689 | 7662 | 0.2032 |

Target | 493 | 45106 | 8134 | 0.1803 |

Note: another possible option would be to look at the difference in *average (unweighted) users conversion rates *between groups*. *

```
# Function to get the statistic
def calculate_statistic(data):
conv_rates = (
data
.groupby('group')
.agg({'conversions': 'sum', 'sessions': 'sum'})
.assign(conv_rate=lambda x: x['conversions']/x['sessions'])
['conv_rate']
)
return conv_rates['Target'] - conv_rates['Control']
```

```
# Actual observed difference
observed_statistic = calculate_statistic(df)
observed_statistic
```

`-0.0229`

For each sample, the statistic of difference in conversion rates between groups is returned and appended to an array.

```
# Boostrap n times over all users, with replacement
n_bootstrap = 200
bootstrap_statistics = []
for i in range(n_bootstrap):
df_boot = (
df
.reset_index()
.groupby('group')
.apply(lambda x: x.sample(frac=.99, replace=True))
.reset_index(drop=True)
)
bootstrap_statistics.append(calculate_statistic(df_boot))
bootstrap_statistics = np.array(bootstrap_statistics)
```

**This is the very definition of p-value**: if we sampled the data repeatedly, how often would we get a more extreme value than the observed difference?

```
# Calculate p-value
p_value = (np.abs(bootstrap_statistics) >= np.abs(observed_statistic)).sum() / n_bootstrap
print("p-value: {:.3f}".format(p_value)
```

`p-value: 0.465`

It happens to be non significant, with a p-value close to 0.5.