Sessions
Let’s assume that we have a simple table events
with two columns, user_id
and timestamp
, where each event writes a new row.
-- Show source data
SELECT *
FROM `training.events`
LIMIT 10
| user_id | timestamp |
|---------|---------------------------|
| 1 | 2020-10-02 15:36:02+00:00 |
| 1 | 2020-10-02 15:36:05+00:00 |
| 1 | 2020-10-02 15:36:06+00:00 |
| 1 | 2020-10-02 15:36:10+00:00 |
| 1 | 2020-10-02 15:36:14+00:00 |
| 2 | 2020-10-06 17:05:31+00:00 |
| 2 | 2020-10-06 17:06:10+00:00 |
| 2 | 2020-10-06 17:06:14+00:00 |
| 2 | 2020-10-06 17:07:35+00:00 |
| 2 | 2020-10-06 17:07:42+00:00 |
To aggregate events by users sessions, you first need to decide what time interval will mark the end of a session. In this example, we will set the gap at 120 seconds.
Defining a session can then be made in three steps:
- Get timestamp previous event, partitioned by user
- If the previous timestamp is Null or greater than 120 seconds, mark a new session (1) else not (0)
- Sum this column over each user, which will be this user session ID
-- Step 1: get timestamp of previous event
WITH sub1 AS (
SELECT
user_id,
timestamp,
LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_timestamp
FROM `training.events`
),
-- Step 2: calculate start of new session
-- Either first event of a user, or previous event was more than 120 seconds ago
sub2 AS (
SELECT
*,
CASE
WHEN prev_timestamp IS NULL THEN 1
WHEN TIMESTAMP_DIFF(timestamp, prev_timestamp, SECOND) > 120 THEN 1
ELSE 0
END AS is_new_session
FROM sub1
)
-- Step 3: calculate ID of session
SELECT
user_id,
timestamp,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY timestamp) AS user_session_id
FROM sub2
| user_id | timestamp | user_session_id |
|---------|---------------------------|-----------------|
| 366 | 2020-10-01 10:23:26+00:00 | 1 |
| 366 | 2020-10-01 10:23:28+00:00 | 1 |
| 366 | 2020-10-01 10:23:29+00:00 | 1 |
| 366 | 2020-10-01 10:23:31+00:00 | 1 |
| 366 | 2020-10-01 10:23:32+00:00 | 1 |
| ... | ... | ... |
| 475 | 2020-10-02 21:48:47+00:00 | 4 |
| 475 | 2020-10-02 21:48:51+00:00 | 4 |
| 475 | 2020-10-02 21:48:54+00:00 | 4 |
| 475 | 2020-10-02 21:48:57+00:00 | 4 |
| 475 | 2020-10-02 21:48:59+00:00 | 4 |
Streaks
Calculating date streaks (visits over consecutive days) is similar from sessions, but the interval between current and previous date should be calculated in days, not timestamps:
-- BigQuery
DATE_DIFF(date, LAG(date) OVER (PARTITION BY user_id ORDER BY timestamp), day) > 1
-- PostgreSQL
date - LAG(date) OVER (PARTITION BY user_id ORDER BY timestamp) > 1
Another possible solution is this nice trick: group streaks by substracting the rolling total of rows (by user) to each day. If a new day is added at every row, they will always be the same. Otherwise there will be an offset, that can be used as value for grouping streaks:
-- BigQuery
DATE_ADD(date, INTERVAL -RANK() OVER (PARTITION BY user_id ORDER BY date) DAY) AS streak_group
-- PostgreSQL
date - RANK() OVER (ORDER BY date) * INTERVAL '1 DAY'