Skip to main content

Manifest Tables

Each of our packages has a set of manifest tables that manage the Incremental Sessionization Logic logic of our package, as well as quarantining long running sessions.

danger

These manifest tables are critical to the package and as such are protected from full refreshes, i.e. being dropped, when running in production by default. In development refreshes are enabled.

The allow_refresh() macro defines the protection behavior. As dbt recommends, target names are used here to differentiate between your prod and dev environment. By default, this macro assumes your dev target is named dev. This can be changed by setting the snowplow__dev_target_name var in your dbt_project.yml file.

To full refresh any of the manifest models in production as part of a --full-refresh, set the snowplow__allow_refresh to true at run time.

Alternatively, you can amend the behavior of this macro entirely by overwriting it. See the Overwriting Macros section for more details.

Incremental Manifestโ€‹

The majority of our packages have an incremental manifest table; by default this is in your _snowplow_manifest suffixed schema, and will have the name snowplow_<package_name>_incremental_manifest. This table exists to track the state of each of the models in the package, including any custom models that have been tagged.

This table has 2 columns:

  • model: The name of the model
  • last_success: The timestamp of the max last event processed by the model, based on the field defined in the snowplow__session_timestamp variable

As this is the source of truth for processing for the package, it is highly recommended to never alter or edit this table directly as this can lead to unexpected outcomes. The manifest table is updated as part of an on-run-end hook, which calls the snowplow_incremental_post_hook() macro.

Sessions Lifecycle Manifestโ€‹

The majority of our packages have a session lifecycle manifest table; by default this is in your _snowplow_manifest suffixed schema, and will have the name snowplow_<package_name>_base_sessions_lifecycle_manifest. This table exists to track the start and end timestamps of any given session, allowing us to avoid a full table scan each run.

This table has 4 columns:

  • session_identifier: The unique identifier for the session
  • user_identifier: The unique user identifier for the session. Note that if multiple user identifiers exist for a given session we only take one.
  • start_tstamp: The timestamp of the first event to process for the session (note due to late arriving data, this may not actually be the true first event in rare cases)
  • end_tstamp: The timestamp of the last event to process for the session (note this may be capped with the snowplow__max_session_days variable)

Timestamps based on the field defined in the snowplow__session_timestamp variable. This table is used to calculate which sessions are to be processed in a run, and also helps inform the run limits.

Quarantine Tableโ€‹

Many of our packages have a quarantine table; by default this is in your _snowplow_manifest suffixed schema, and will have the name snowplow_<package_name>_base_quarantined_sessions. This table exists to track sessions that have gone beyond the snowplow__max_session_days limit and avoid re-processing them again.

This table has 1 column:

  • session_identifier: The unique identifier for the session to not process in any further run.

This table is updated on the post-hook of the base sessions this run table. If there are any additional sessions you identify you want to remove from processing, you can manually add them to this table but ensure they do not already exist.