Understanding the structure of Snowplow data
Overview
In order to analyze Snowplow data, it is important to understand how it is structured. We have tried to make the structure of Snowplow data as simple, logical, and easy-to-query as possible.
Each line represents one event
Each line in the Snowplow events table represents a single event, be that a page view, add to basket, play video, like etc.
Structured data
Snowplow data is structured: individual fields are stored in their own columns, making writing sophisticated queries on the data easy, and making it straightforward for analysts to plugin any kind of analysis tool into their Snowplow data to compose and execute queries.
Extensible schema
Snowplow started life as a web analytics data warehousing platform, and has a basic schema suitable for performing web analytics, with a wide range of web-specific dimensions (related to page URLs, browsers, operating systems, devices, IP addresses, cookie IDs) and web-specific events (page views, page pings, transactions). All of these fields can be found in the atomic.events
table, which is a "fat" (many columns) table.
As Snowplow has evolved into a general purpose event analytics platform, we've enabled Snowplow users to define additional event types (we call these self describing events) and define their own entities (we call these custom entities) so that they can extend the schema to suit their own businesses.
For Snowplow users running Amazon Redshift, each type of self-describing event and each type of entity will be stored in their own dedicated tables. These additional tables can be joined back to the core atomic.events
table, by joining on the root_id
field in the self-describing event / entity table with the event_id
in the atomic.events
table, and the root_tstamp
and collector_tstamp
field in the respective tables. For users on other warehouses these will be additional columns in the atomic.events
table.
Single table
All the events are effectively stored in a single table, making running queries across the data very easy. Even if you're running Snowplow with Redshift and have extended the schema as described above, you can still query the data as if it were in a single fat table. This is because:
- The joins from the additional tables to the core
atomic.events
table are one-to-one. - The field joined on is the distribution key for both tables, so queries are as fast as if the data were in a single table.
Immutable log
The Snowplow data table is designed to be immutable: the data in each line should not change over time. Data points that we would expect to change over time (e.g. what cohort a particular user belongs to, how we classify a particular visitor) can be derived from Snowplow data. However, our recommendation is that these derived fields should be defined and calculated at analysis time, stored in a separate table and joined to the Snowplow events table when performing any analysis.
Out-of-the-box fields
The sections below go over the standard fields found in all Snowplow events grouped into categories. Some are platform independent (=comes out-of-the-box for both web and mobile trackers) and some will only get populated for a specific platform as indicated in the tables below ( ✅ | ❌ ).
As we are moving away from the wide table structure and shift the canonical fields into specific, tematically grouped entities (contexts, self-describing events) in places we will include a link to the specific tracking documentation so that you can find more information of the out-of-the-box options.
We also provide a Source
for each field, however sometimes a field may have multiple sources e.g. a value originally set by a tracker but is overwritten by a later enrichment; in this case we have tended to classify this as the earliest source and adding more details where useful.
Common fields (platform and event independent)
Web-specific fields
Event-specific fields (some are platform dependent)
Common fields (platform and event independent)
Application fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
app_id | text | Application ID | Yes | 'angry-birds' | Tracking | ✅ | ✅ |
platform | text | Platform, limited to specific values | Yes | 'web' | Tracking | ✅ | ✅ |
The application ID is used to distinguish different applications that are being tracked by the same Snowplow stack, e.g. production
versus dev
.
The platform ID is used to distinguish the same app running on different platforms, e.g. iOS
vs web
.
Date / time fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
collector_tstamp | timestamp | Timestamp for the event recorded by the collector | Yes | '2013-11-26 00:02:05' | Pipeline | ✅ | ✅ |
dvce_created_tstamp | timestamp | Timestamp for the event recorded on the client device | No | '2013-11-26 00:03:57.885' | Tracking | ✅ | ✅ |
dvce_sent_tstamp | timestamp | When the event was actually sent by the client device | No | '2013-11-26 00:03:58.032' | Tracking | ✅ | ✅ |
etl_tstamp | timestamp | Timestamp for when the event was validated and enriched. Note: the name is historical and does not mean that the event is loaded at this point (this is further downstream). | No | '2017-01-26 00:01:25.292' | Pipeline | ✅ | ✅ |
os_timezone | text | Client operating system timezone | No | 'Europe/London' | Tracking: Timezone Plugin | ✅ | ✅ |
derived_tstamp | timestamp | Timestamp making allowance for inaccurate device clock | No | '2013-11-26 00:02:04' | Default Enrichment | ✅ | ✅ |
true_tstamp | timestamp | User-set "true timestamp" for the event | No | '2013-11-26 00:02:04' | Tracking | ✅ | ✅ |
load_tstamp | timestamp | Timestamp for when the data was loaded into the warehouse, best choice for incremental processing | No | '2013-11-26 00:02:04' | Pipeline | ✅ | ✅ |
Event / transaction fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
event | text | The type of event recorded | Yes | 'page_view' | Tracking | ✅ | ✅ |
event_id | text | A UUID for each event | Yes | 'c6ef3124-b53a-4b13-a233-0088f79dcbcb' | Tracking (or enrichment if empty) | ✅ | ✅ |
txn_id | int | Transaction ID set client-side, used to de-dupe records | No | 421828 | Tracking (Deprecated) | ✅ | ✅ |
event_fingerprint | text | Hash client-set event fields, used to de-dupe records | No | AADCE520E20C2899F4CED228A79A3083 | Event Fingerprint Enrichment | ✅ | ✅ |
A complete list of event types is given here.
Snowplow version fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
v_tracker | text | Tracker version | Yes | 'js-3.0.0' | Tracking | ✅ | ✅ |
v_collector | text | Collector version | Yes | 'ssc-2.1.0-kinesis' | Pipeline | ✅ | ✅ |
v_etl | text | ETL version | Yes | 'snowplow-micro-1.1.0-common-1.4.2' | Default Enrichment | ✅ | ✅ |
name_tracker | text | Tracker namespace | No | 'sp1' | Tracking | ✅ | ✅ |
Some Snowplow Trackers allow the user to name each specific Tracker instance. name_tracker
corresponds to this name, and can be used to distinguish which tracker generated which events.
User-related fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
user_id | text | Unique ID set by business | No | 'c94f860b-1266-4dad-ae57-3a36a414a521' | Tracking | ✅ | ✅ |
domain_userid | text | User ID set by Snowplow using 1st party client-set cookie, (only used for Web events!) | No | '4b0dfa75-9a8c-46a1-9691-01add9db4200' | Tracking | ✅ | ❌ |
network_userid | text | User ID set by Snowplow using server-set cookie, which may be 1st or 3rd party, depending on collector configuration. | No | 'ecdff4d0-9175-40ac-a8bb-325c49733607' | Tracking or Pipeline | ✅ | ✅ |
user_ipaddress | text | User IP address, can be overwritten with the IP Anonymization Enrichment | No | '92.231.54.234' | Tracking or Pipeline | ✅ | ✅ |
domain_sessionidx | int | A visit / session index | No | 3 | Tracking | ✅ | ❌ |
domain_sessionid | text | A visit / session identifier | No | 'c6ef3124-b53a-4b13-a233-0088f79dcbcb' | Tracking | ✅ | ❌ |
domain_sessionidx
is the number of the current user session. For example, an event occurring during a user's first session would have domain_sessionidx
set to 1. The JavaScript Tracker calculates this field by storing a visit count in a first-party cookie. Whenever the Tracker fires an event, if more than 30 minutes have elapsed since the last event, the visitor count is increased by 1. (Whenever an event is fired, a "session cookie" is created and set to expire in 30 minutes. This is how the Tracker can tell whether the visit count should be incremented.) Thirty minutes is the default value and can be changed using the sessionCookieTimeout
configuration option in the tracker.
It is worth to call out here that the domain_userid
is regarded as the most reliable session based identifier for most use cases, it is therefore treated as the primary user_identifier
field in our data models that rely on sessionisation (e.g. the Unified data model). These are only populated for web events, the mobile equivalent is embedded into an ootb entity called Session Context Entity which are then extracted (session__user_id
) then coalesced with the web based domain_userid by default and being referred to as user_identifier
within the data modeling package.
Device and operating system fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
useragent | text | Raw useragent | No | Tracking or Pipeline | ✅ | ✅ | |
dvce_type | text | Type of device | No | 'Computer' | Deprecated | ✅ | ✅ |
dvce_ismobile | boolean | Is the device mobile? | No | 1 | Deprecated | ✅ | ✅ |
dvce_screenheight | int | Screen height in pixels | No | 1024 | Tracking | ✅ | ✅ |
dvce_screenwidth | int | Screen width in pixels | No | 1900 | Tracking | ✅ | ✅ |
os_name | text | Name of operating system | No | 'Android' | Deprecated | ✅ | ✅ |
os_family | text | Operating system family | No | 'Linux' | Deprecated | ✅ | ✅ |
os_manufacturer | text | Company responsible for OS | No | 'Apple' | Deprecated | ✅ | ✅ |
For more information on this topic please check out the relevant Tracking Documentation.
Location fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
geo_country | text | ISO 3166-1 code for the country the visitor is located in | No | 'GB', 'US' | IP Enrichment | ✅ | ✅ |
geo_region | text | ISO-3166-2 code for country region the visitor is in | No | 'I9', 'TX' | IP Enrichment | ✅ | ✅ |
geo_city | text | City the visitor is in | No | 'New York', 'London' | IP Enrichment | ✅ | ✅ |
geo_zipcode | text | Postcode the visitor is in | No | '94109' | IP Enrichment | ✅ | ✅ |
geo_latitude | text | Visitor location latitude | No | 37.443604 | IP Enrichment | ✅ | ✅ |
geo_longitude | text | Visitor location longitude | No | -122.4124 | IP Enrichment | ✅ | ✅ |
geo_region_name | text | Visitor region name | No | 'Florida' | IP Enrichment | ✅ | ✅ |
geo_timezone | text | Visitor timezone name | No | 'Europe/London' | IP Enrichment | ✅ | ✅ |
For more information on this topic please check out the relevant Tracking Documentation.
IP address-based fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
ip_isp | text | Visitor's ISP | No | 'FDN Communications' | IP Enrichment | ✅ | ✅ |
ip_organization | text | Organization associated with the visitor's IP address - defaults to ISP name if none is found | No | 'Bouygues Telecom' | IP Enrichment | ✅ | ✅ |
ip_domain | text | Second level domain name associated with the visitor's IP address | No | 'nuvox.net' | IP Enrichment | ✅ | ✅ |
ip_netspeed | text | Visitor's connection type | No | 'Cable/DSL' | IP Enrichment | ✅ | ✅ |
Metadata fields
Fields containing information about the event type.
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
event_vendor | text | Who defined the event | Yes | 'com.acme' | Default Enrich (event schema) | ✅ | ✅ |
event_name | text | Event name | Yes | 'link_click' | Default Enrich (event schema) | ✅ | ✅ |
event_format | text | Format for event | Yes | 'jsonschema' | Default Enrich (event schema) | ✅ | ✅ |
event_version | text | Version of event schema | Yes | '1-0-2' | Default Enrich (event schema) | ✅ | ✅ |
Marketing / traffic source fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
mkt_medium | text | Type of traffic source | No | 'cpc', 'affiliate', 'organic', 'social' | Campaign Attribution Enrichment | ✅ | ✅ |
mkt_source | text | The company / website where the traffic came from | No | 'Google', 'Facebook' | Campaign Attribution Enrichment | ✅ | ✅ |
mkt_term | text | Any keywords associated with the referrer | No | 'new age tarot decks' | Campaign Attribution Enrichment | ✅ | ✅ |
mkt_content | text | The content of the ad. (Or an ID so that it can be looked up.) | No | 13894723 | Campaign Attribution Enrichment | ✅ | ✅ |
mkt_campaign | text | The campaign ID | No | 'diageo-123' | Campaign Attribution Enrichment | ✅ | ✅ |
mkt_clickid | text | The click ID | No | 'ac3d8e459' | Campaign Attribution Enrichment | ✅ | ✅ |
mkt_network | text | The ad network to which the click ID belongs | No | 'DoubleClick' | Campaign Attribution Enrichment | ✅ | ✅ |
Platform-specific fields
Web-specific fields
Page Fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
page_url | text | The page URL | No | 'http://www.example.com' | Tracking | ✅ | ❌ |
page_urlscheme | text | Scheme aka protocol | No | 'https' | Default Enrichment (url) | ✅ | ❌ |
page_urlhost | text | Host aka domain | No | '“www.snowplowanalytics.com' | Default Enrichment (url) | ✅ | ❌ |
page_urlport | int | Port if specified, scheme dependent if not (443 for https, 80 for http) | No | 80 | Default Enrichment (url) | ✅ | ❌ |
page_urlpath | text | Path to page | No | '/product/index.html' | Default Enrichment (url) | ✅ | ❌ |
page_urlquery | text | Querystring | No | 'id=GTM-DLRG' | Default Enrichment (url) | ✅ | ❌ |
page_urlfragment | text | Fragment aka anchor | No | '4-conclusion' | Default Enrichment (url) | ✅ | ❌ |
page_referrer | text | URL of the referrer | No | 'http://www.referrer.com' | Tracking | ✅ | ❌ |
page_title | text | Web page title | No | 'Snowplow Docs - Understanding the structure of Snowplow data' | Tracking | ✅ | ❌ |
refr_urlscheme | text | Referrer scheme | No | 'http' | Default Enrichment (referer) | ✅ | ❌ |
refr_urlhost | text | Referrer host | No | 'www.bing.com' | Default Enrichment (referer) | ✅ | ❌ |
refr_urlport | int | Referrer port | No | 80 | Default Enrichment (referer) | ✅ | ❌ |
refr_urlpath | text | Referrer page path | No | '/images/search' | Default Enrichment (referer) | ✅ | ❌ |
refr_urlquery | text | Referrer URL querystring | No | 'q=psychic+oracle+cards' | Default Enrichment (referer) | ✅ | ❌ |
refr_urlfragment | text | Referrer URL fragment | No | Default Enrichment (referer) | ✅ | ❌ | |
refr_medium | text | Type of referrer | No | 'search', 'internal' | Referrer Parser Enrichment | ✅ | ❌ |
refr_source | text | Name of referrer if recognized | No | 'Bing images' | Referrer Parser Enrichment | ✅ | ❌ |
refr_term | text | Keywords if source is a search engine | No | 'psychic oracle cards' | Referrer Parser Enrichment | ✅ | ❌ |
refr_domain_userid | text | The Snowplow domain_userid of the referring website | No | 'bc2e92ec6c204a14' | Default Enrichment (cross_domain) | ✅ | ❌ |
refr_dvce_tstamp | timestamp | The time of attaching the domain_userid to the inbound link | No | '2013-11-26 00:02:05' | Default Enrichment (cross_domain ) | ✅ | ❌ |
Document fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
doc_charset | text | The page’s character encoding | No | , 'UTF-8' | Tracking | ✅ | ❌ |
doc_width | int | The page's width in pixels | No | 1024 | Tracking | ✅ | ❌ |
doc_height | int | The page's height in pixels | No | 3000 | Tracking | ✅ | ❌ |
Browser fields
Field | Type | Description | Reqd? | Example | Source | Web | Mobile |
---|---|---|---|---|---|---|---|
user_fingerprint | int | A user fingerprint generated by looking at the individual browser features | No | 2161814971 | Tracking (Deprecated) | ❌ | ❌ |
br_name | text | Browser name | No | 'Firefox 12' | Deprecated | ❌ | ❌ |
br_version | text | Browser version | No | '12.0' | Deprecated | ❌ | ❌ |
br_family | text | Browser family | No | 'Firefox' | Deprecated | ❌ | ❌ |
br_type | text | Browser type | No | 'Browser' | Deprecated | ❌ | ❌ |
br_renderengine | text | Browser rendering engine | No | 'GECKO' | Deprecated | ❌ | ❌ |
br_lang | text | Language the browser is set to | No | 'en-GB' | Tracking | ✅ | ❌ |
br_features_pdf | boolean | Whether the browser recognizes PDFs | No | 1 | Deprecated | ✅ | ❌ |
br_features_flash | boolean | Whether Flash is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_java | boolean | Whether Java is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_director | boolean | Whether Adobe Shockwave is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_quicktime | boolean | Whether QuickTime is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_realplayer | boolean | Whether RealPlayer is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_windowsmedia | boolean | Whether mplayer2 is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_gears | boolean | Whether Google Gears is installed | No | 1 | Deprecated | ✅ | ❌ |
br_features_silverlight | boolean | Whether Microsoft Silverlight is installed | No | 1 | Deprecated | ✅ | ❌ |
br_cookies | boolean | Whether cookies are enabled | No | 1 | Tracking | ✅ | ❌ |
br_colordepth | int | Bit depth of the browser color palette | No | 24 | Tracking | ✅ | ❌ |
br_viewheight | int | Viewport height | No | 1000 | Tracking | ✅ | ❌ |
br_viewwidth | int | Viewport width | No | 1000 | Tracking | ✅ | ❌ |
For more information on this topic please check out the relevant Tracking Documentation.
Event-specific fields
Snowplow includes specific fields to capture data associated with specific events.
Note that to date, all event types have been defined by Snowplow. Also note that event_vendor
values follow the Java package naming convention.
Snowplow currently supports the following event types:
Event type | Value of event field in model |
---|---|
Page views | 'page_view' |
Page pings | 'page_ping' |
E-commerce transactions | 'transaction' and 'transaction_item' |
Structured events | 'struct' |
Self-describing events | 'unstruct' (for legacy reasons) |
Details of which fields are available for which events are given below. In some cases these events will store values in the atomic.events
columns, which are listed below.
Page views
There are currently no fields that are specific to page_view
events: all the fields that are required are part of the standard fields available for any web-based event e.g. page_urlscheme
, page_title
.
Page pings
There are four additional fields included with page pings that indicate how a user has scrolled over a web page since the last page ping:
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
pp_xoffset_min | integer | Minimum page x offset seen in the last ping period | No | 10 |
pp_xoffset_max | integer | Maximum page x offset seen in the last ping period | No | 100 |
pp_yoffset_min | integer | Minimum page y offset seen in the last ping period | No | 5 |
pp_yoffset_max | integer | Maximum page y offset seen in the last ping period | No | 200 |
E-commerce transactions
There are a large number of fields specifically for transaction events.
Fields that start tr_
relate to the transaction as a whole. Fields that start ti_
refer to the specific item included in the transaction. (E.g. a product in the basket.) Single transactions typically span multiple lines of data: there will be a single line where event
= transaction
, where the tr_
fields are set, and multiple lines (one for each product included) where event
= transaction_item
and the ti_
fields are set.
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
tr_orderid | text | Order ID | Yes | '#134' |
tr_affiliation | text | Transaction affiliation (e.g. store where sale took place) | No | 'web' |
tr_total | decimal | Total transaction value | Yes | 12.99 |
tr_tax | decimal | Total tax included in transaction value | No | 3.00 |
tr_shipping | decimal | Delivery cost charged | No | 0.00 |
tr_total_base * | decimal | Total in base currency | No | 12.99 |
tr_tax_base * | decimal | Total tax in base currency | No | 3.00 |
tr_shipping_base * | decimal | Delivery cost in base currency | No | 0.00 |
tr_city | text | Delivery address, city | No | 'London' |
tr_state | text | Delivery address, state | No | 'Washington' |
tr_country | text | Delivery address, country | No | 'France' |
tr_currency | text | Currency | No | 'USD' |
ti_orderid | text | Order ID | Yes | '#134' |
ti_sku | text | Product SKU | Yes | 'pbz00123' |
ti_name | text | Product name | No | 'Cone pendulum' |
ti_category | text | Product category | No | 'New Age' |
ti_price | decimal | Product unit price | Yes | 9.99 |
ti_price_base * | decimal | Price in base currency | No | 9.99 |
ti_quantity | integer | Number of product in transaction | Yes | 2 |
ti_currency | text | Currency | No | 'EUR' |
base_currency * | text | Reporting currency | No | 'GBP' |
* Set exclusively by the Currency conversion enrichment.
For more information on this topic please check out the relevant Tracking Documentation.
Structured events
Structured events allow you to send your own custom data, as long as it fits in the following 5 fields:
Field | Type | Description | Reqd? | Example |
---|---|---|---|---|
se_category | text | Category of event | Yes* | 'ecomm', 'video' |
se_action | text | Action performed / event name | Yes* | 'add-to-basket', 'play-video' |
se_label | text | The object of the action e.g. the ID of the video played or SKU of the product added-to-basket | No | 'pbz00123' |
se_property | text | A property associated with the object of the action | No | 'HD', 'large' |
se_value | decimal | A value associated with the event / action e.g. the value of goods added-to-basket | No | 9.99 |
* These fields are only required for struct
events.
For more information on this topic please check out the relevant Tracking Documentation.
Self-describing events
Self-describing events can contain their own set of fields, defined by their schema.
For each type of self-describing event, there will be a dedicated column (or table, in case of Redshift and Postgres) that holds the event-specific fields.
See querying data for more details on the structure and how to query it in different warehouses. You might also want to check how schema definitions translate to the warehouse.
For more information on this topic please check out the relevant Tracking Documentation.
Entities
Entities (also known as contexts) provide extra information about the event, such as data describing a product or a user.
For each type of entity, there will be a dedicated column (or table, in case of Redshift and Postgres) that holds entity-specific fields. Note that an event can have any number of entities attached, including multiple entities of the same type. For this reason, the data inside the entity columns is an array.
See querying data for more details on the structure and how to query it in different warehouses. You might also want to check how schema definitions translate to the warehouse.
For more information on this topic please check out the relevant Tracking Documentation.
Out-of-the-box self-describing events and entities
These are also a variety of self-describing events and entities defined by Snowplow. You can find their schemas here.