Assign users to experiment groups with hash functions
In the case of statistical experiments, some hash functions that produce integers can be used for easily pseudo-randomly and deterministically assign users to groups.
The process works as follows:
- Hash a unique identifier of each user, for example their UID –> it will generate a large integer
- Optionally get the absolute value of the hash, because it can be negative
- Apply a modulo to the integer, corresponding to the number of experimentation groups that you want
Hash in Python
⚠️ Warning: in Python, using the built-in hash()
function will produce different outputs across sessions if you don’t explicitly specify the same hash seed value every time.
# Set hash seed
%env PYTHONHASHSEED=0
# Import libraries
import pandas as pd
# Sample dataframe of UIDs
df = pd.DataFrame({'uid':[
'd4212ed3-4399-4f5b-87ab-faec4f6558f8',
'a9780c08-9714-4da2-a53c-6102d5f47cf7',
'26c6853f-cae9-4a06-849d-7df46d744e5a',
'efc1583b-946a-449f-a811-87d241e57bc4',
'60d568e2-052b-4ca3-83da-34ee00ef5c6b'
]})
df
|
uid |
0 |
d4212ed3-4399-4f5b-87ab-faec4f6558f8 |
1 |
a9780c08-9714-4da2-a53c-6102d5f47cf7 |
2 |
26c6853f-cae9-4a06-849d-7df46d744e5a |
3 |
efc1583b-946a-449f-a811-87d241e57bc4 |
4 |
60d568e2-052b-4ca3-83da-34ee00ef5c6b |
# Step 1a: compute hash value for each user UID
df['uid_hash'] = df['uid'].apply(hash)
df
|
uid |
uid_hash |
0 |
d4212ed3-4399-4f5b-87ab-faec4f6558f8 |
8654918514890274169 |
1 |
a9780c08-9714-4da2-a53c-6102d5f47cf7 |
1340609583511541598 |
2 |
26c6853f-cae9-4a06-849d-7df46d744e5a |
-1679195277781215170 |
3 |
efc1583b-946a-449f-a811-87d241e57bc4 |
-2786820932382908882 |
4 |
60d568e2-052b-4ca3-83da-34ee00ef5c6b |
3180348019046944236 |
# Step 1b: get the absolute value of the hash
df['abs_hash'] = df['uid_hash'].apply(abs)
df
|
uid |
uid_hash |
abs_hash |
0 |
d4212ed3-4399-4f5b-87ab-faec4f6558f8 |
8654918514890274169 |
8654918514890274169 |
1 |
a9780c08-9714-4da2-a53c-6102d5f47cf7 |
1340609583511541598 |
1340609583511541598 |
2 |
26c6853f-cae9-4a06-849d-7df46d744e5a |
-1679195277781215170 |
1679195277781215170 |
3 |
efc1583b-946a-449f-a811-87d241e57bc4 |
-2786820932382908882 |
2786820932382908882 |
4 |
60d568e2-052b-4ca3-83da-34ee00ef5c6b |
3180348019046944236 |
3180348019046944236 |
# Step 3: apply a modulo, to reduce each hash
# In this example, we want 5 groups
groups = 5
df['user_group'] = df['abs_hash'].apply(lambda x: x % groups)
df
|
uid |
uid_hash |
abs_hash |
user_group |
0 |
d4212ed3-4399-4f5b-87ab-faec4f6558f8 |
8654918514890274169 |
8654918514890274169 |
4 |
1 |
a9780c08-9714-4da2-a53c-6102d5f47cf7 |
1340609583511541598 |
1340609583511541598 |
3 |
2 |
26c6853f-cae9-4a06-849d-7df46d744e5a |
-1679195277781215170 |
1679195277781215170 |
0 |
3 |
efc1583b-946a-449f-a811-87d241e57bc4 |
-2786820932382908882 |
2786820932382908882 |
2 |
4 |
60d568e2-052b-4ca3-83da-34ee00ef5c6b |
3180348019046944236 |
3180348019046944236 |
1 |
Hash in Redshift
Redshift provides several hash functions, with one algorithm named FNV outputing integers with FNV_HASH()
.
-- Redshift hash with FNV_HASH() function
SELECT
uid,
FNV_HASH(uid) AS uid_hash,
ABS(FNV_HASH(uid)) AS abs_hash,
MOD(ABS(FNV_HASH(uid)), 5) AS user_group
FROM users
uid |
uid_hash |
abs_hash |
user_group |
e5bc5b34-7591-440b-afe5-abd09ce9e7fc |
3227803941034506429 |
3227803941034506429 |
4 |
45537e99-fb26-4f91-bc05-54609a1aaea4 |
-5122133721459384104 |
5122133721459384104 |
4 |
dd654604-c782-4bb7-b8b1-f4d29544901d |
-6617981390633479988 |
6617981390633479988 |
3 |
cdd58d6c-cc32-4d08-9bc2-5dbd9794a850 |
-8431386626058726789 |
8431386626058726789 |
4 |
312fa481-ade1-4049-a0a2-1f8fd8127038 |
-6575192304779364999 |
6575192304779364999 |
4 |
92a78700-1776-4bdd-9a30-3f1e2a5215ad |
-3502909049117672277 |
3502909049117672277 |
2 |
ff9bf307-46ed-4eaa-8768-56d9bd243638 |
-3896326129685120733 |
3896326129685120733 |
3 |
df0feb24-46fb-4413-a6dd-bc720a8404e0 |
-1574156297740572233 |
1574156297740572233 |
3 |
c88e885f-7e07-4164-bbef-8b3e06b01c03 |
-2193445785569549438 |
2193445785569549438 |
3 |
e6840bd5-a7f8-4bca-b3c4-dd0d063bd805 |
7232213398192042271 |
7232213398192042271 |
1 |
Hash in Biguery
BigQuery provides a deterministic fingerprinting based on the FarmHash library with the FARM_FINGERPRINT()
function
-- BigQuery provides FARM_FINGERPRINT() for hash
SELECT
uid,
FARM_FINGERPRINT(uid) AS uid_hash,
ABS(FARM_FINGERPRINT(uid)) AS abs_hash,
MOD(ABS(FARM_FINGERPRINT(uid)), 5) AS user_group
FROM users
uid |
uid_hash |
abs_hash |
user_group |
e5bc5b34-7591-440b-afe5-abd09ce9e7fc |
-7787564224466934590 |
7787564224466934590 |
0 |
45537e99-fb26-4f91-bc05-54609a1aaea4 |
5608318025082862745 |
5608318025082862745 |
0 |
dd654604-c782-4bb7-b8b1-f4d29544901d |
5934209085968642182 |
5934209085968642182 |
2 |
cdd58d6c-cc32-4d08-9bc2-5dbd9794a850 |
-616326417315919032 |
616326417315919032 |
2 |
312fa481-ade1-4049-a0a2-1f8fd8127038 |
8906102320423035382 |
8906102320423035382 |
2 |
92a78700-1776-4bdd-9a30-3f1e2a5215ad |
-6944503593848300382 |
6944503593848300382 |
2 |
ff9bf307-46ed-4eaa-8768-56d9bd243638 |
941176159813038931 |
941176159813038931 |
1 |
df0feb24-46fb-4413-a6dd-bc720a8404e0 |
-7011198657730055701 |
7011198657730055701 |
1 |
c88e885f-7e07-4164-bbef-8b3e06b01c03 |
-7735882866151441807 |
7735882866151441807 |
2 |
e6840bd5-a7f8-4bca-b3c4-dd0d063bd805 |
7843601242992569940 |
7843601242992569940 |
0 |