Aggregate small cohorts for A/B testing

Published on
May 15, 2023
📖  Table of content


In statistical hypothesis testing, it is sometimes necessary to aggregate cohorts that were not started on the same day.

For example, if you want to call 500 customers at random and see if this has an impact on their mid-term retention, you may not be able to call them all on day 1. If it takes 20 days to call all the 500 customers in the target group, you would end up with 20 small cohorts.

However, if certain conditions are met, you can aggregate the small cohorts and conduct a global analysis.


  1. For each cohort, a random sample must be drawn to create the target and control groups. For example, if you can call 25 customers every day, you should choose 50 random customers and split them into two groups. If you want to call only a specific subset of customers, such as customers who have placed a certain number of orders, you should have the same properties for the control group. To make sure that samples were drawn at random, it is good practice to check for pre-biases between the groups.
  2. When analyzing customer retention, your starting point will obviously be different for each cohort. For example, if you want to analyze retention at D+30 days, Day 0 will be the day of the call for each cohort.

If these conditions are met, you can aggregate the two groups and apply a standard test.


Say we have 20 cohorts of 50 users, with a 50% split in Control and Target groups. When the experiment period is over, we can look at the D+15 days retention of each cohort.

In the graph below, we have plotted the 95% confidence interval of the daily retention for each group, up to 15 days after the call. A first visual analysis shows no obvious difference between the groups’ retention.

Python code
n_rows = 4
n_cols = 5
fig, ax = plt.subplots(n_rows, n_cols, figsize=(14, 24))

for k,v in enumerate(df['date'].unique()[:n_rows*n_cols]):

    df_ret_day = (
        .loc[lambda x: x['date'] == v]
            id_vars=['group', 'user_id'],
            value_vars=["orders_d" + str(i) for i in range(15)],
            days=lambda x: x['variable'].str.replace('orders_d', '').astype(int),
            active=lambda x: x['orders'] > 0,
        .groupby(['group', 'days'])
        .agg({'user_id': 'count', 'active': 'sum', 'orders': 'sum'})
            retention=lambda x: x['active'] / x['user_id'],
            moe=lambda x: np.sqrt(x['retention'] * (1-x['retention']) / x['user_id']) * 1.96,
            ci_low=lambda x: x['retention'] - x['moe'],
            ci_hi=lambda x: x['retention'] + x['moe'],
        .rename(columns={'user_id': 'users'})
    axe = ax[k//n_cols, k%n_cols]

        x=df_ret_day.loc['control', 'days'], 
        y1=df_ret_day.loc['control', 'ci_low'], 
        y2=df_ret_day.loc['control', 'ci_hi'],
        x=df_ret_day.loc['target', 'days'], 
        y1=df_ret_day.loc['target', 'ci_low'], 
        y2=df_ret_day.loc['target', 'ci_hi'],
    axe.set(title=v, xticklabels=[], yticklabels=[], xlabel=None, ylabel=None) 
Customer daily retention per cohort and group (control / target).
Customer daily retention per cohort and group (control / target).

Customer retention per aggregated group.
Customer retention per aggregated group.

Since we have respected the above conditions, we can now aggregate all cohorts, and constitute bigger Control and Target groups.

⬅️ Then we can plot the retention up to D+15 and see if the results are more significant. In this example, they are not, as the graph suggests, which is confirmed by a proper t-test.