Compute sessions and streaks in SQL

Sessions

Let’s assume that we have a simple table events with two columns, user_id and timestamp, where each event writes a new row.

-- Show source data
SELECT *
FROM `training.events`
LIMIT 10
| user_id | timestamp                 |
|---------|---------------------------|
| 1       | 2020-10-02 15:36:02+00:00 |
| 1       | 2020-10-02 15:36:05+00:00 |
| 1       | 2020-10-02 15:36:06+00:00 |
| 1       | 2020-10-02 15:36:10+00:00 |
| 1       | 2020-10-02 15:36:14+00:00 |
| 2       | 2020-10-06 17:05:31+00:00 |
| 2       | 2020-10-06 17:06:10+00:00 |
| 2       | 2020-10-06 17:06:14+00:00 |
| 2       | 2020-10-06 17:07:35+00:00 |
| 2       | 2020-10-06 17:07:42+00:00 |

To aggregate events by users sessions, you first need to decide what time interval will mark the end of a session. In this example, we will set the gap at 120 seconds.

Defining a session can then be made in three steps:

  1. Get timestamp previous event, partitioned by user
  2. If the previous timestamp is Null or greater than 120 seconds, mark a new session (1) else not (0)
  3. Sum this column over each user, which will be this user session ID

-- Step 1: get timestamp of previous event
WITH sub1 AS (
    SELECT 
        user_id,
        timestamp,
        LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_timestamp
    FROM `training.events`
),

-- Step 2: calculate start of new session
-- Either first event of a user, or previous event was more than 120 seconds ago
sub2 AS (
    SELECT
        *,
        CASE 
            WHEN prev_timestamp IS NULL THEN 1
            WHEN TIMESTAMP_DIFF(timestamp, prev_timestamp, SECOND) > 120 THEN 1
            ELSE 0
        END AS is_new_session
    FROM sub1
)

-- Step 3: calculate ID of session
SELECT
    user_id,
    timestamp,
    SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY timestamp) AS user_session_id
FROM sub2
| user_id | timestamp                 | user_session_id |
|---------|---------------------------|-----------------|
| 366     | 2020-10-01 10:23:26+00:00 | 1               |
| 366     | 2020-10-01 10:23:28+00:00 | 1               |
| 366     | 2020-10-01 10:23:29+00:00 | 1               |
| 366     | 2020-10-01 10:23:31+00:00 | 1               |
| 366     | 2020-10-01 10:23:32+00:00 | 1               |
| ...     | ...                       | ...             |
| 475     | 2020-10-02 21:48:47+00:00 | 4               |
| 475     | 2020-10-02 21:48:51+00:00 | 4               |
| 475     | 2020-10-02 21:48:54+00:00 | 4               |
| 475     | 2020-10-02 21:48:57+00:00 | 4               |
| 475     | 2020-10-02 21:48:59+00:00 | 4               |

Streaks

Calculating date streaks (visits over consecutive days) is similar from sessions, but the interval between current and previous date should be calculated in days, not timestamps:

-- BigQuery
DATE_DIFF(date, LAG(date) OVER (PARTITION BY user_id ORDER BY timestamp), day) > 1

-- PostgreSQL
date - LAG(date) OVER (PARTITION BY user_id ORDER BY timestamp) > 1

Another possible solution is this nice trick: group streaks by substracting the rolling total of rows (by user) to each day. If a new day is added at every row, they will always be the same. Otherwise there will be an offset, that can be used as value for grouping streaks:

-- BigQuery
DATE_ADD(date, INTERVAL -RANK() OVER (PARTITION BY user_id ORDER BY date) DAY) AS streak_group

-- PostgreSQL
date - RANK() OVER (ORDER BY date) * INTERVAL '1 DAY'