Skip to main content
Release Versionย Actively Maintainedย Snowplow Personal and Academic License

Snowplow Unified Digital Model

The package source code can be found in the snowplow/dbt-snowplow-unified repo, and the docs for the model design here.

The package contains a fully incremental model that transforms raw web and mobile event data generated by the Snowplow JavaScript tracker and the Snowplow iOS or android trackers into a series of derived tables of varying levels of aggregation.

The Snowplow Unified Digital Model aggregates Snowplow's out of the box page view, screen view, and page ping events to create a set of derived tables - views, sessions and users - that contain many useful dimensions as well as calculated measures such as time engaged and scroll depth.

Unified Digital Model data flowUnified Digital Model data flow

Overviewโ€‹

This model consists of a series of modules, each producing a table which serves as the input to the next module. The 'standard' modules are:

  • Base: Performs the incremental logic, outputting the table snowplow_unified_base_events_this_run which contains a de-duped data set of all events required for the current run of the model. From this table it also generates the snowplow_unified_events_this_run table, which coalesces multiple fields to combine web and mobile fields coming from different contexts / sdes.
  • Views: Aggregates event level data to page view / screen view level, view_id, outputting the table snowplow_unified_views.
  • Sessions: Aggregates event level data to a session level, session_identifier, outputting the table snowplow_unified_sessions. Includes other events but requires at least one page_view, screen_view or page_ping event in the session.
  • Users: Aggregates session level data to a users level, user_identifier, outputting the table snowplow_unified_users.
  • User Mapping: Provides a mapping between user identifiers, user_identifier and user_id, outputting the table snowplow_unified_user_mapping. This can be used for session stitching.

Supported Entitiesโ€‹

While using any entity in our packages is possible thanks to modeling entities, a large set of common web and mobile entities are built into the processing of the package to add to your derived tables. Note these are in addition to those required to run the package.

EntityTypeEnabled via Variable
YAUAAwebsnowplow__enable_yauaa
IABwebsnowplow__enable_iab
UAwebsnowplow__enable_ua
Browserwebsnowplow__enable_browser_context, snowplow__enable_browser_context_2 (depending on schema versions tracked, when both are enabled the values are coalesced)
Mobilemobilesnowplow__enable_mobile_context
Geolocationmobilesnowplow__enable_geolocation_context
Applicationmobilesnowplow__enable_application_context
Screenmobilesnowplow__enable_screen_context
Deep Linksmobilesnowplow__enable_deep_link_context
Screen Summarymobilesnowplow__enable_screen_summary_context

Optional Modulesโ€‹

ModuleDocsEnabled via Variable
Consent Reportingsnowplow__enable_consent
Core Web Vitalssnowplow__enable_cwv
App Errorssnowplow__enable_app_errors
Conversionssnowplow__enable_conversions

Engaged vs. Absolute Timeโ€‹

At a view- and session-level we provide two measures of time; absolute, how long a user had the page open, and engaged, how much of that time the user was on the page. Engaged time is often a large predictor of a customer conversion, such as a purchase or a sign-up, whatever that may be in your domain.

Calculating absolute time is simple, it's the difference between the derived_tstamp of the first and last (page view or page ping) events within that page view/session.

Web Calculationโ€‹

The calculation for engaged time on web is more complicated, it is derived based on page pings which means if the user isn't active on your content, the engaged time does not increase. Let's consider a single page view example of reading an article; partway through the reader may see something they don't understand, so they open a new tab and look this up. They might stumble upon a Wikipedia page on it, they go down a rabbit hole and 10 minutes later they make it back to your site to finish the article. In this case there will be a gap for those 10 minutes in the page pings in the events data.

To adjust for these gaps we calculate engaged time as the time to trigger each ping (your heartbeat) times the number of pings (ignoring the first one), and add to that the time delay to the first ping (your minimum visit length). The formula is:

engaged_time=theartbeatร—(ndistinct_pingsโˆ’1)+tmin_visit_lengthengaged\_time=t_{heartbeat}\times (n_{distinct\_pings} -1) + t_{min\_visit\_length}

and the below shows an example visually for a single page view.

Page views and pings showing gaps to highlight the difference between absolute and engaged timePage views and pings showing gaps to highlight the difference between absolute and engaged time

At a session level, this calculation is slightly more involved, as it needs to happen per page view and account for stray page pings, but the underlying idea is the same.

Mobile Calculationโ€‹

For Mobile we use the screen_summary entity from the mobile trackers for the engaged time. Check out the mobile engagement demo for a live view of this.

Stray Page Pings (Web only)โ€‹

Stray Page Pings are pings within a session that do not have a corresponding page_view event within the same session. The most common cause of these is someone returning to a tab after their session has timed out but not refreshing the page. The page_view event exists in some other session, but there is no guarantee that both these sessions will be processed in the same run, which could lead to different results. Depending on your site content and user behavior the prevalence of sessions with stray page pings could vary greatly. For example with long-form content we have seen around 10% of all sessions contain only stray page pings (i.e. no page_view events).

We take different approaches to adjust for these stray pings at the page view and sessions levels, which can lead to differences between the two tables, but each is as accurate as we can currently make it.

Sessionsโ€‹

As all our processing ensures full sessions are reprocessed, our sessions level table includes all stray page ping events, as well as all other view and ping events. We adjust the start time down based on your minimum visit length if the session starts with a page ping, and we include sessions that contain only (stray) pings. We also count page views based on the number of unique page_view_ids you have (from the web_page context) rather than using absolute page_view events to include these stray pings, and account for stray pings in the engaged time. Overall this is a more accurate view of a session and treats the stray pings as if they had a corresponding page_view event in the same session, even when they did not.

The result of this is you may see misalignment between sessions and if you tried to recalculate them based directly off the page views table; this is because we discard stray pings during page view processing as discussed below, so the values (views, engaged_time_in_s, and absolute_time_in_s) in the sessions table may be higher, but are more accurate at a session level.

Stray page ping sessionisationStray page ping sessionisation

Page Viewsโ€‹

For page views, because we cannot guarantee the sessions with the page_view event and all subsequent page_ping events are processed within the same run, we choose to discard all stray page pings. Without doing this it could be possible that you would get different results from different run configurations.

Without enforcing within-session view

Stray page ping page viewsStray page ping page views

With enforcing within-session view

Stray page ping page viewsStray page ping page views

info

Currently we do not process these discarded stray page pings in any way, meaning that engaged time and scroll depth in these cases may be under representative of the true value. Due to session level reprocessing this remains a complicated issue to resolve, but please let us know if you would like to help solve this!