Assign users to experiment groups with hash functions

In the case of statistical experiments, some hash functions that produce integers can be used for easily pseudo-randomly and deterministically assign users to groups.

The process works as follows:

  1. Hash a unique identifier of each user, for example their UID –> it will generate a large integer
  2. Optionally get the absolute value of the hash, because it can be negative
  3. Apply a modulo to the integer, corresponding to the number of experimentation groups that you want

Hash in Python

⚠️ Warning: in Python, using the built-in hash() function will produce different outputs across sessions if you don’t explicitly specify the same hash seed value every time.

# Set hash seed
%env PYTHONHASHSEED=0
env: PYTHONHASHSEED=0
# Import libraries
import pandas as pd

# Sample dataframe of UIDs
df = pd.DataFrame({'uid':[
    'd4212ed3-4399-4f5b-87ab-faec4f6558f8',
    'a9780c08-9714-4da2-a53c-6102d5f47cf7',
    '26c6853f-cae9-4a06-849d-7df46d744e5a',
    'efc1583b-946a-449f-a811-87d241e57bc4',
    '60d568e2-052b-4ca3-83da-34ee00ef5c6b'
]})
df
uid
0 d4212ed3-4399-4f5b-87ab-faec4f6558f8
1 a9780c08-9714-4da2-a53c-6102d5f47cf7
2 26c6853f-cae9-4a06-849d-7df46d744e5a
3 efc1583b-946a-449f-a811-87d241e57bc4
4 60d568e2-052b-4ca3-83da-34ee00ef5c6b
# Step 1a: compute hash value for each user UID
df['uid_hash'] = df['uid'].apply(hash)
df
uid uid_hash
0 d4212ed3-4399-4f5b-87ab-faec4f6558f8 8654918514890274169
1 a9780c08-9714-4da2-a53c-6102d5f47cf7 1340609583511541598
2 26c6853f-cae9-4a06-849d-7df46d744e5a -1679195277781215170
3 efc1583b-946a-449f-a811-87d241e57bc4 -2786820932382908882
4 60d568e2-052b-4ca3-83da-34ee00ef5c6b 3180348019046944236
# Step 1b: get the absolute value of the hash
df['abs_hash'] = df['uid_hash'].apply(abs)
df
uid uid_hash abs_hash
0 d4212ed3-4399-4f5b-87ab-faec4f6558f8 8654918514890274169 8654918514890274169
1 a9780c08-9714-4da2-a53c-6102d5f47cf7 1340609583511541598 1340609583511541598
2 26c6853f-cae9-4a06-849d-7df46d744e5a -1679195277781215170 1679195277781215170
3 efc1583b-946a-449f-a811-87d241e57bc4 -2786820932382908882 2786820932382908882
4 60d568e2-052b-4ca3-83da-34ee00ef5c6b 3180348019046944236 3180348019046944236
# Step 3: apply a modulo, to reduce each hash
# In this example, we want 5 groups
groups = 5
df['user_group'] = df['abs_hash'].apply(lambda x: x % groups)
df
uid uid_hash abs_hash user_group
0 d4212ed3-4399-4f5b-87ab-faec4f6558f8 8654918514890274169 8654918514890274169 4
1 a9780c08-9714-4da2-a53c-6102d5f47cf7 1340609583511541598 1340609583511541598 3
2 26c6853f-cae9-4a06-849d-7df46d744e5a -1679195277781215170 1679195277781215170 0
3 efc1583b-946a-449f-a811-87d241e57bc4 -2786820932382908882 2786820932382908882 2
4 60d568e2-052b-4ca3-83da-34ee00ef5c6b 3180348019046944236 3180348019046944236 1

Hash in Redshift

Redshift provides several hash functions, with one algorithm named FNV outputing integers with FNV_HASH().

-- Redshift hash with FNV_HASH() function
SELECT
    uid,
    FNV_HASH(uid) AS uid_hash,
    ABS(FNV_HASH(uid)) AS abs_hash,
    MOD(ABS(FNV_HASH(uid)), 5) AS user_group
FROM users
uid uid_hash abs_hash user_group
e5bc5b34-7591-440b-afe5-abd09ce9e7fc 3227803941034506429 3227803941034506429 4
45537e99-fb26-4f91-bc05-54609a1aaea4 -5122133721459384104 5122133721459384104 4
dd654604-c782-4bb7-b8b1-f4d29544901d -6617981390633479988 6617981390633479988 3
cdd58d6c-cc32-4d08-9bc2-5dbd9794a850 -8431386626058726789 8431386626058726789 4
312fa481-ade1-4049-a0a2-1f8fd8127038 -6575192304779364999 6575192304779364999 4
92a78700-1776-4bdd-9a30-3f1e2a5215ad -3502909049117672277 3502909049117672277 2
ff9bf307-46ed-4eaa-8768-56d9bd243638 -3896326129685120733 3896326129685120733 3
df0feb24-46fb-4413-a6dd-bc720a8404e0 -1574156297740572233 1574156297740572233 3
c88e885f-7e07-4164-bbef-8b3e06b01c03 -2193445785569549438 2193445785569549438 3
e6840bd5-a7f8-4bca-b3c4-dd0d063bd805 7232213398192042271 7232213398192042271 1

Hash in Biguery

BigQuery provides a deterministic fingerprinting based on the FarmHash library with the FARM_FINGERPRINT() function

-- BigQuery provides FARM_FINGERPRINT() for hash
SELECT
    uid,
    FARM_FINGERPRINT(uid) AS uid_hash,
    ABS(FARM_FINGERPRINT(uid)) AS abs_hash,
    MOD(ABS(FARM_FINGERPRINT(uid)), 5) AS user_group
FROM users
uid uid_hash abs_hash user_group
e5bc5b34-7591-440b-afe5-abd09ce9e7fc -7787564224466934590 7787564224466934590 0
45537e99-fb26-4f91-bc05-54609a1aaea4 5608318025082862745 5608318025082862745 0
dd654604-c782-4bb7-b8b1-f4d29544901d 5934209085968642182 5934209085968642182 2
cdd58d6c-cc32-4d08-9bc2-5dbd9794a850 -616326417315919032 616326417315919032 2
312fa481-ade1-4049-a0a2-1f8fd8127038 8906102320423035382 8906102320423035382 2
92a78700-1776-4bdd-9a30-3f1e2a5215ad -6944503593848300382 6944503593848300382 2
ff9bf307-46ed-4eaa-8768-56d9bd243638 941176159813038931 941176159813038931 1
df0feb24-46fb-4413-a6dd-bc720a8404e0 -7011198657730055701 7011198657730055701 1
c88e885f-7e07-4164-bbef-8b3e06b01c03 -7735882866151441807 7735882866151441807 2
e6840bd5-a7f8-4bca-b3c4-dd0d063bd805 7843601242992569940 7843601242992569940 0