# Reduce required sample size with CUPED

Category
Statistics
Published on
April 24, 2023
📖  Table of content

# Introduction

When conducting an statistical experiment, there are three levers to reach the necessary statistical power:

↗️ Increase effect size

↗️ Increase sample size

↘️ Decrease variance

Usually, the minimum detectable effect size and the variance are fixed, and we have to play with the sample size to reach the desired statistical power. However, this can be difficult or long to achieve for some experiments.

CUPED is a simple and effective methodology, developed by Microsoft in 2013, and now widely used in the industry, with big Tech companies such as Microsoft, Netflix, BBC, Booking.com, etc.

🎯
Its purpose is to reduce the variance of an experiment metric. Reducing the variance can help reach statistical significance of an experiment with a smaller sample size, hence reducing the experiment duration.

The original paper claims to have reduced the variance by half on Bing experiments, hence dividing the necessary sample size by approximately two.

There are two methods for implementing CUPED: stratification and covariate-based.

Stratification involves the use of categorical pre-experiment data like country or browser type. We will focus on the other method, the covariate, which uses any continuous metric calculated for each unit before they are exposed to the experiment.

💡
The fundamental principle is that the share of variance that pre-experiment data can explain in a metric is unrelated to any effects of the experiment. Therefore, it can be removed. In other words, we remove the within-group variance, to keep only the inter-groups variance.

# Applying CUPED

1. Get the data for the metric before the experiment
2. In theory, you could use any strongly correlated metric, but in practice it is more convenient to use the same metric before the experiment, if available.

Finding the ideal duration for the pre-experiment period is subject to debate. A shorter window doesn’t capture enough variance and a longer window captures noise. For longer experiments, a longer pre-experiment window is needed to ensure that the same users are observed both during and prior to the experiment.

3. With this pre-experiment metric, calculate a coefficient $\theta$ based on the covariance between values before ($X$) and during ($Y$) the experiment:
$\theta = \frac{cov(Y,X)}{var(X)}$
1. For each experiment observation, calculate the adjusted value $\hat{Y}_i$ such as:
2. $\hat{Y}_i = Y_i - \theta X_i$

Each observation can for instance be a user, a device, a cookie, etc.

3. Apply the statistical test between control and treatment groups on the new adjusted values.
4. The CUPED-adjusted variance should be lower than the raw variance, hence should generate more significant results for the same sample size and effect size.

More precisely, the expected reduction in variance can be calculated with:

$var(\hat{Y}) = var(Y-\theta X) = var(Y)(1- corr(X,Y)^2)$

# Python implementation

Below is an example of Python implementation on synthetic data, that displays a drastic reduction in variance.

1. We start by generating synthetic data, with a control and target group. Post-experiment, the target group will have an average uplift of +3.
2. # Load libraries
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

# Generate pre-experiment data
# for control (x_c) and treatment (x_t) groups
x_c = list(np.random.normal(loc=100, scale=50, size=1000))
x_t = list(np.random.normal(loc=100, scale=50, size=1000))

# Generate post-experiment data
# for control (y_c) and treatment (y_t) groups
eps_sigma = 40
treatment_lift = 3
y_c = [i + np.random.normal(loc=0, scale=eps_sigma) for i in x_c]
y_t = [i + np.random.normal(loc=0, scale=eps_sigma) + treatment_lift for i in x_t]

3. We can plot the distributions of each group on the post-experiment phase. Visually, there is no hint of a significant difference between distributions.
4. # Plot raw distributions
sns.histplot([y_c, y_t], kde=True, alpha=0.3) 1. We can try to apply a t-test on unadjusted data. We get a non-significant difference, with a p-value around 0.19. The variance of the groups is >2500.
2. # Show tests results before CUPED
print("Non-adjusted variance (treatment group): {:.0f}".format(np.var(y_t)))
Non-adjusted p-value: 0.194
Non-adjusted variance (treatment group): 2553

3. Now we apply the CUPED formula and compute the adjusted values for each observation. CUPED coefficient θ is around 1.
4. # Compute CUPED ajusted values
theta = np.cov([y_c+y_t, x_c+x_t])[0,1]/np.var(x_c+x_t)
y_c_adj = [y - x * theta for x, y in zip(x_c, y_c)]
y_t_adj = [y - x * theta for x, y in zip(x_t, y_t)]

print("Theta: {:.3f}".format(theta))
Theta: 1.002

5. We can plot the distributions before and after CUPED adjustments. It’s visually obvious that the variance had been strongly reduced by CUPED.
6. # Plot CUPED adjusted distributions
sns.histplot([y_c, y_t, y_c_adj, y_t_adj], kde=True, alpha=0.3) 7. Finally, we re-apply a t-test on adjusted data. This time, the p-value is 0, thanks to a strong reduction in variance, from around 2500 to 98.
8. # Show tests results for CUPED-adjusted values
print("CUPED adjusted variance (treatment group): {:.0f}".format(np.var(y_t_adj)))
CUPED adjusted p-value: 0.0000
CUPED adjusted variance (treatment group): 98