Category
Data Science
Published on
June 13, 2021
‣
Introduction
In the case of statistical experiments, some hash functions that produce integers can be used for easily pseudo-randomly and deterministically assign users to groups.
The process works as follows:
- Hash a unique identifier of each user, for example their UID –> it will generate a large integer
- Optionally get the absolute value of the hash, because it can be negative
- Apply a modulo to the integer, corresponding to the number of experimentation groups that you want
Hash in Python
⚠️ Warning: in Python, using the built-in hash()
function will produce different outputs across sessions if you don’t explicitly specify the same hash seed value every time.
# Set hash seed
%env PYTHONHASHSEED=0
# Import libraries
import pandas as pd
# Sample dataframe of UIDs
df = pd.DataFrame({'uid':[
'd4212ed3-4399-4f5b-87ab-faec4f6558f8',
'a9780c08-9714-4da2-a53c-6102d5f47cf7',
'26c6853f-cae9-4a06-849d-7df46d744e5a',
'efc1583b-946a-449f-a811-87d241e57bc4',
'60d568e2-052b-4ca3-83da-34ee00ef5c6b'
]})
df
uid | |
0 | d4212ed3-4399-4f5b-87ab-faec4f6558f8 |
1 | a9780c08-9714-4da2-a53c-6102d5f47cf7 |
2 | 26c6853f-cae9-4a06-849d-7df46d744e5a |
3 | efc1583b-946a-449f-a811-87d241e57bc4 |
4 | 60d568e2-052b-4ca3-83da-34ee00ef5c6b |
# Step 1a: compute hash value for each user UID
df['uid_hash'] = df['uid'].apply(hash)
df
uid | uid_hash | |
0 | d4212ed3-4399-4f5b-87ab-faec4f6558f8 | 8654918514890274169 |
1 | a9780c08-9714-4da2-a53c-6102d5f47cf7 | 1340609583511541598 |
2 | 26c6853f-cae9-4a06-849d-7df46d744e5a | -1679195277781215170 |
3 | efc1583b-946a-449f-a811-87d241e57bc4 | -2786820932382908882 |
4 | 60d568e2-052b-4ca3-83da-34ee00ef5c6b | 3180348019046944236 |
# Step 1b: get the absolute value of the hash
df['abs_hash'] = df['uid_hash'].apply(abs)
df
uid | uid_hash | abs_hash | |
0 | d4212ed3-4399-4f5b-87ab-faec4f6558f8 | 8654918514890274169 | 8654918514890274169 |
1 | a9780c08-9714-4da2-a53c-6102d5f47cf7 | 1340609583511541598 | 1340609583511541598 |
2 | 26c6853f-cae9-4a06-849d-7df46d744e5a | -1679195277781215170 | 1679195277781215170 |
3 | efc1583b-946a-449f-a811-87d241e57bc4 | -2786820932382908882 | 2786820932382908882 |
4 | 60d568e2-052b-4ca3-83da-34ee00ef5c6b | 3180348019046944236 | 3180348019046944236 |
# Step 3: apply a modulo, to reduce each hash
# In this example, we want 5 groups
groups = 5
df['user_group'] = df['abs_hash'].apply(lambda x: x % groups)
df
uid | uid_hash | abs_hash | user_group | |
0 | d4212ed3-4399-4f5b-87ab-faec4f6558f8 | 8654918514890274169 | 8654918514890274169 | 4 |
1 | a9780c08-9714-4da2-a53c-6102d5f47cf7 | 1340609583511541598 | 1340609583511541598 | 3 |
2 | 26c6853f-cae9-4a06-849d-7df46d744e5a | -1679195277781215170 | 1679195277781215170 | 0 |
3 | efc1583b-946a-449f-a811-87d241e57bc4 | -2786820932382908882 | 2786820932382908882 | 2 |
4 | 60d568e2-052b-4ca3-83da-34ee00ef5c6b | 3180348019046944236 | 3180348019046944236 | 1 |
Hash in Redshift
Redshift provides several hash functions, with one algorithm named FNV outputing integers with FNV_HASH()
.
-- Redshift hash with FNV_HASH() function
SELECT
uid,
FNV_HASH(uid) AS uid_hash,
ABS(FNV_HASH(uid)) AS abs_hash,
MOD(ABS(FNV_HASH(uid)), 5) AS user_group
FROM users
uid | uid_hash | abs_hash | user_group |
e5bc5b34-7591-440b-afe5-abd09ce9e7fc | 3227803941034506429 | 3227803941034506429 | 4 |
45537e99-fb26-4f91-bc05-54609a1aaea4 | -5122133721459384104 | 5122133721459384104 | 4 |
dd654604-c782-4bb7-b8b1-f4d29544901d | -6617981390633479988 | 6617981390633479988 | 3 |
cdd58d6c-cc32-4d08-9bc2-5dbd9794a850 | -8431386626058726789 | 8431386626058726789 | 4 |
312fa481-ade1-4049-a0a2-1f8fd8127038 | -6575192304779364999 | 6575192304779364999 | 4 |
92a78700-1776-4bdd-9a30-3f1e2a5215ad | -3502909049117672277 | 3502909049117672277 | 2 |
ff9bf307-46ed-4eaa-8768-56d9bd243638 | -3896326129685120733 | 3896326129685120733 | 3 |
df0feb24-46fb-4413-a6dd-bc720a8404e0 | -1574156297740572233 | 1574156297740572233 | 3 |
c88e885f-7e07-4164-bbef-8b3e06b01c03 | -2193445785569549438 | 2193445785569549438 | 3 |
e6840bd5-a7f8-4bca-b3c4-dd0d063bd805 | 7232213398192042271 | 7232213398192042271 | 1 |
Hash in BigQuery
BigQuery provides a deterministic fingerprinting based on the FarmHash library with the FARM_FINGERPRINT()
function
-- BigQuery provides FARM_FINGERPRINT() for hash
SELECT
uid,
FARM_FINGERPRINT(uid) AS uid_hash,
ABS(FARM_FINGERPRINT(uid)) AS abs_hash,
MOD(ABS(FARM_FINGERPRINT(uid)), 5) AS user_group
FROM users
uid | uid_hash | abs_hash | user_group |
e5bc5b34-7591-440b-afe5-abd09ce9e7fc | -7787564224466934590 | 7787564224466934590 | 0 |
45537e99-fb26-4f91-bc05-54609a1aaea4 | 5608318025082862745 | 5608318025082862745 | 0 |
dd654604-c782-4bb7-b8b1-f4d29544901d | 5934209085968642182 | 5934209085968642182 | 2 |
cdd58d6c-cc32-4d08-9bc2-5dbd9794a850 | -616326417315919032 | 616326417315919032 | 2 |
312fa481-ade1-4049-a0a2-1f8fd8127038 | 8906102320423035382 | 8906102320423035382 | 2 |
92a78700-1776-4bdd-9a30-3f1e2a5215ad | -6944503593848300382 | 6944503593848300382 | 2 |
ff9bf307-46ed-4eaa-8768-56d9bd243638 | 941176159813038931 | 941176159813038931 | 1 |
df0feb24-46fb-4413-a6dd-bc720a8404e0 | -7011198657730055701 | 7011198657730055701 | 1 |
c88e885f-7e07-4164-bbef-8b3e06b01c03 | -7735882866151441807 | 7735882866151441807 | 2 |
e6840bd5-a7f8-4bca-b3c4-dd0d063bd805 | 7843601242992569940 | 7843601242992569940 | 0 |