Full calculation for run timestamps

tip

The following information is meant for educational purposes only, there should never be a need to alter this logic as the package manages it all for you.

Timestamps used in our packages

When calculating the timestamps required for our incremental sessionization logic by default we use collector_tstamp; historically this was the field the events table was most likely to be partitioned on. Nowadays load_tstamp is increasingly common, and users may have changed the partition field of their table. To accommodate this and make use of more efficient filtering you can alter the field we use to identify new events using the snowplow__session_timestamp variable. Note that where possible this should match your partition key, and the choice will impact the calculation when looking for which events to process.

Calculation for run timestamps

The details provided above cover how we calculate the first step of the date range to process on a run, based on the state of models in the package and any new data. This is not the full story as in a given run the date range of specific events that are available in the events this run table will differ from this. What follows is the full details of how this is calculated and processed.

Step 1: Calculate Base New Event Limits

The first step is to identify the model limits for the run. This is accomplished by the get_incremental_manifest_status macro to find the min and max last success in the manifest, and then uses the get_run_limits macro to identify which state the package is in and calculate the timestamp range for this run, as detailed above.

This range is then printed to the console and stored in the snowplow_<package_name>_base_new_event_limits table for later use.

Step 2: Update sessions lifecycle manifest

Next, we need to now process the sessions with new events into the sessions lifecycle manifest table. Here we apply many filters to the events table before calculating the identifiers and minimum and maximum snowplow__session_timestamp for each session.

The filters that are applied are:

Filter to events within the limits

To minimize the table scan and ensure a consistent processing of data, we filter to events with a timestamp bound by the limits defined in step 1.

and {{ session_timestamp }} >= {{ lower_limit }}
and {{ session_timestamp }} <= {{ upper_limit }}

If snowplow__derived_tstamp_partitioned is set to true, and it is running in BigQuery, this will also apply the filter to the derived_tstamp field as well.

Filter late arriving data

To avoid scanning too far back in the events table for a session, and to avoid recalculating sessions far in the past, we filter out late arriving data based on the device created and send timestamp. How late these events can be sent is defined by the snowplow__days_late_allowed variable, which has a default of 3.

and dvce_sent_tstamp <= {{ snowplow_utils.timestamp_add('day', days_late_allowed, 'dvce_created_tstamp') }}

Filter app IDs

Based on the input provided in the snowplow__app_id variable, we filter to only those app ids provided.

and {{ snowplow_utils.app_id_filter(app_ids) }}

Filter out quarantined sessions

We also exclude any session identifiers listed in the quarantine manifest table to avoid processing long running sessions. For a session to be quarantined it must have events spanning longer than the value in your snowplow__max_session_days variable.

where session_identifier is not null
    and not exists (select 1 from {{ ref(quarantined_sessions) }} as a where a.session_identifier = e.session_identifier) 

Once this is all calculated, we merge these with the existing manifest and take the least of the start timestamp, and the greatest of the end timestamp. In the case of a session running over the snowplow__max_session_days the end timestamp is re-calculated based on this instead.

Step 3: Building sessions this run

We next identify which sessions need to be included in the current run, based on our event run limits and the start timestamp of that session. Using the return_base_new_event_limits macro we get the upper and lower limits as calculated in step 1, but also the lower limit minus the snowplow__max_session_days to get the earliest a session could start and still have events included in this run (the session start limit).

We then include in the run any session that:

Starts after the session start limit
Starts before or at the upper limit for the run
Ends after the lower limit for the run

The base sessions this run table then contains any sessions that satisfy these conditions from the lifecycle manifest, but with an end_tstamp capped at the upper limit.

Step 4: Build events this run

Finally we query the events this run table and filter to events that satisfy the following conditions:

Session identifier is within the base sessions this run table created in Step 3
Timestamp is between the min start and max end of sessions in the base sessions this run table created in Step 3
Timestamp is greater than or equal to the start timestamp for that specific session identifier in base sessions this run table created in Step 3
(Optional) App ID is in the list provided

This is the deduplicated and other transformations are applied such as adding any entities, SDEs, or custom sql.

tip

The result of all of this is that the range of timestamps in the events this run table will be larger than the range printed in the log, and stored in the base new events limits table, but is required to ensure we correctly process complete sessions and are able to handle late arriving data.

Timestamps used in our packages​

Calculation for run timestamps​

Step 1: Calculate Base New Event Limits​

Step 2: Update sessions lifecycle manifest​

Filter to events within the limits​

Filter late arriving data​

Filter app IDs​

Filter out quarantined sessions​

Step 3: Building sessions this run​

Step 4: Build events this run​