Skip to main content

Understanding the structure of Snowplow data

Overview​

In order to analyze Snowplow data, it is important to understand how it is structured. We have tried to make the structure of Snowplow data as simple, logical, and easy-to-query as possible.

Each line represents one event​

Each line in the Snowplow events table represents a single event, be that a page view, add to basket, play video, like etc.

Structured data​

Snowplow data is structured: individual fields are stored in their own columns, making writing sophisticated queries on the data easy, and making it straightforward for analysts to plugin any kind of analysis tool into their Snowplow data to compose and execute queries.

Extensible schema​

Snowplow started life as a web analytics data warehousing platform, and has a basic schema suitable for performing web analytics, with a wide range of web-specific dimensions (related to page URLs, browsers, operating systems, devices, IP addresses, cookie IDs) and web-specfic events (page views, page pings, transactions). All of these fields can be found in the atomic.events table, which is a "fat" (many columns) table.

As Snowplow has evolved into a general purpose event analytics platform, we've enabled Snowplow users to define additional event types (we call these custom unstructured events) and define their own entities (we call these custom contexts) so that they can extend the schema to suit their own businesses.

note

Currently, custom events and custom contexts are not available in BDP Cloud.

For Snowplow users running Amazon Redshift, each custom unstructured event and custom context will be stored in its own dedicated table, again with one line per event. These additional tables can be joined back to the core atomic.events table, by joining on th e root_id field in the custom unstructured event / custom context table with the event_id in the atomic.events table.

Single table​

All the events are effectively stored in a single table, making running queries across the data very easy. Even if you're running Snowplow with Redshift and have extended the schema as described above, you can still query the data as if it were in a single fat table. This is because:

  • The joins from the additional tables to the core atomic.events table are one-to-one.
  • The field joined on is the distribution key for both tables, so queries are as fast as if the data were in a single table.

Immutable log​

The Snowplow data table is designed to be immutable: the data in each line should not change over time. Data points that we would expect to change over time (e.g. what cohort a particular user belongs to, how we classify a particular visitor) can be derived from Snowplow data. However, our recommendation is that these derived fields should be defined and calculated at analysis time, stored in a separate table and joined to the Snowplow events table when performing any analysis.

Understanding the individual fields​

Common fields (platform and event independent)​

Application fields​

FieldTypeDescriptionReqd?Example
app_idtextApplication IDYes'angry-birds'
platformtextPlatformYes'web'

The application ID is used to distinguish different applications that are being tracked by the same Snowplow stack, e.g. production versus dev.

The platform ID is used to distinguish the same app running on different platforms, e.g. iOS vs web.

Date / time fields​

FieldTypeDescriptionReqd?Example
collector_tstamptimestampTimestamp for the event recorded by the collectorYes'2013-11-26 00:02:05'
dvce_created_tstamptimestampTimestamp for the event recorded on the client deviceNo'2013-11-26 00:03:57.885'
dvce_sent_tstamptimestampWhen the event was actually sent by the client deviceNo'2013-11-26 00:03:58.032'
etl_tstamptimestampTimestamp for when the event was validated and enriched. Note: the name is historical and does not mean that the event is loaded at this point (this is further downstream).No'2017-01-26 00:01:25.292'
os_timezone (not available in BDP Cloud)textClient operating system timezoneNo'Europe/London'
derived_tstamptimestampTimestamp making allowance for innaccurate device clockNo'2013-11-26 00:02:04'
true_tstamptimestampUser-set "true timestamp" for the eventNo'2013-11-26 00:02:04'

Event / transaction fields​

FieldTypeDescriptionReqd?Example
eventtextThe type of event recordedYes'page_view'
event_idtextA UUID for each eventYes'c6ef3124-b53a-4b13-a233-0088f79dcbcb'
txn_idintTransaction ID set client-side, used to de-dupe recordsNo421828
event_fingerprinttextHash client-set event fieldsNoAADCE520E20C2899F4CED228A79A3083

A complete list of event types is given here.

Snowplow version fields​

FieldTypeDescriptionReqd?Example
v_trackertextTracker versionYes'js-3.0.0'
v_collectortextCollector versionYes'ssc-2.1.0-kinesis'
v_etltextETL versionYes'snowplow-micro-1.1.0-common-1.4.2'
name_trackertextTracker namespaceNo'sp1'
etl_tags (not available in BDP Cloud)textJSON of tags for this ETL runNo"['prod']"

Some Snowplow Trackers allow the user to name each specific Tracker instance. name_tracker corresponds to this name, and can be used to distinguish which tracker generated which events.

FieldTypeDescriptionReqd?Example
user_idtextUnique ID set by businessNo'jon.doe@email.com'
domain_useridtextUser ID set by Snowplow using 1st party cookieNo'bc2e92ec6c204a14'
network_useridtextUser ID set by Snowplow using 3rd party cookieNo'ecdff4d0-9175-40ac-a8bb-325c49733607'
user_ipaddresstextUser IP addressNo'92.231.54.234'
domain_sessionidxintA visit / session indexNo3
domain_sessionidtextA visit / session identifierNo'c6ef3124-b53a-4b13-a233-0088f79dcbcb'

domain_sessionidx is the number of the current user session. For example, an event occurring during a user's first session would have domain_sessionidx set to 1. The JavaScript Tracker calculates this field by storing a visit count in a first-party cookie. Whenever the Tracker fires an event, if more than 30 minutes have elapsed since the last event, the visitor count is increased by 1. (Whenever an event is fired, a "session cookie" is created and set to expire in 30 minutes. This is how the Tracker can tell whether the visit count should be incremented.) Thirty minutes is the default value and can be changed using the setSessionCookieTimeout method.

Device and operating system fields​

FieldTypeDescriptionReqd?Example
useragenttextRaw useragentYes
dvce_type (not available in BDP Cloud)textType of deviceNo'Computer'
dvce_ismobile (not available in BDP Cloud)booleanIs the device mobile?No1
dvce_screenheightintScreen height in pixelsNo1024
dvce_screenwidthintScreen width in pixelsNo1900
os_name (not available in BDP Cloud)textName of operating systemNo'Android'
os_family (not available in BDP Cloud)textOperating system familyNo'Linux'
os_manufacturer (not available in BDP Cloud)textCompany responsible for OSNo'Apple'

Location fields​

FieldTypeDescriptionReqd?Example
geo_countrytextISO 3166-1 code for the country the visitor is located inNo'GB', 'US'
geo_regiontextISO-3166-2 code for country region the visitor is inNo'I9', 'TX'
geo_citytextCity the visitor is inNo'New York', 'London'
geo_zipcodetextPostcode the visitor is inNo'94109'
geo_latitudetextVisitor location latitudeNo37.443604
geo_longitudetextVisitor location longitudeNo-122.4124
geo_region_nametextVisitor region nameNo'Florida'
geo_timezonetextVisitor timezone nameNo'Europe/London'

IP address-based fields​

FieldTypeDescriptionReqd?Example
ip_isptextVisitor's ISPNo'FDN Communications'
ip_organizationtextOrganization associated with the visitor's IP address - defaults to ISP name if none is foundNo'Bouygues Telecom'
ip_domaintextSecond level domain name associated with the visitor's IP addressNo'nuvox.net'
ip_netspeedtextVisitor's connection typeNo'Cable/DSL'

Metadata fields​

Fields containing information about the event type.

FieldTypeDescriptionReqd?Example
event_vendortextWho defined the eventNo'com.acme'
event_nametextEvent nameNo'link_click'
event_formattextFormat for eventNo'jsonschema'
event_versiontextVersion of event schemaNo'1-0-2'

Platform-specific fields​

Web-specific fields​

FieldTypeDescriptionReqd?Example
Page fields
page_urltextThe page URLYes'http://www.example.com'
page_urlschemetextScheme aka protocolYes'https'
page_urlhosttextHost aka domainYes'β€œwww.snowplowanalytics.com'
page_urlportintPort if specified, 80 if not80
page_urlpathtextPath to pageNo'/product/index.html'
page_urlquerytextQuerystringNo'id=GTM-DLRG'
page_urlfragmenttextFragment aka anchorNo'4-conclusion'
page_referrertextURL of the referrerNo'http://www.referrer.com'
page_titletextWeb page titleNo'Snowplow Docs - Understanding the structure of Snowplow data'
refr_urlschemetextReferer schemeNo'http'
refr_urlhosttextReferer hostNo'www.bing.com'
refr_urlportintReferer portNo80
refr_urlpathtextReferer page pathNo'/images/search'
refr_urlquerytextReferer URL querystringNo'q=psychic+oracle+cards'
refr_urlfragmenttextReferer URL fragmentNo
refr_mediumtextType of refererNo'search', 'internal'
refr_sourcetextName of referer if recognisedNo'Bing images'
refr_termtextKeywords if source is a search engineNo'psychic oracle cards'
refr_domain_useridtextThe Snowplow domain_userid of the referring websiteNo'bc2e92ec6c204a14'
refr_dvce_tstamptimestampThe time of attaching the domain_userid to the inbound linkNo'2013-11-26 00:02:05'
Document fields
doc_charsettextThe page’s character encodingNo, 'UTF-8'
doc_widthintThe page's width in pixelsNo1024
doc_heightintThe page's height in pixelsNo3000
Marketing / traffic source fields
mkt_mediumtextType of traffic sourceNo'cpc', 'affiliate', 'organic', 'social'
mkt_sourcetextThe company / website where the traffic came fromNo'Google', 'Facebook'
mkt_termtextAny keywords associated with the referrerNo'new age tarot decks'
mkt_contenttextThe content of the ad. (Or an ID so that it can be looked up.)No13894723
mkt_campaigntextThe campaign IDNo'diageo-123'
mkt_clickidtextThe click IDNo'ac3d8e459'
mkt_networktextThe ad network to which the click ID belongsNo'DoubleClick'
Browser fields
user_fingerprint (not available in BDP Cloud)intA user fingerprint generated by looking at the individual browser featuresNo2161814971
connection_type (not available in BDP Cloud)textType of internet connectionNoNo
cookie (not available in BDP Cloud)booleanDoes the browser support persistent cookies?No1
br_name (not available in BDP Cloud)textBrowser nameNo'Firefox 12'
br_version (not available in BDP Cloud)textBrowser versionNo'12.0'
br_family (not available in BDP Cloud)textBrowser familyNo'Firefox'
br_type (not available in BDP Cloud)textBrowser typeNo'Browser'
br_renderengine (not available in BDP Cloud)textBrowser rendering engineNo'GECKO'
br_langtextLanguage the browser is set toNo'en-GB'
br_features_pdf (not available in BDP Cloud)booleanWhether the browser recognizes PDFsNo1
br_features_flash (not available in BDP Cloud)booleanWhether Flash is installedNo1
br_features_java (not available in BDP Cloud)booleanWhether Java is installedNo1
br_features_director (not available in BDP Cloud)booleanWhether Adobe Shockwave is installedNo1
br_features_quicktime (not available in BDP Cloud)booleanWhether QuickTime is installedNo1
br_features_realplayer (not available in BDP Cloud)booleanWhether RealPlayer is installedNo1
br_features_windowsmedia (not available in BDP Cloud)booleanWhether mplayer2 is installedNo1
br_features_gears (not available in BDP Cloud)booleanWhether Google Gears is installedNo1
br_features_silverlight (not available in BDP Cloud)booleanWhether Microsoft Silverlight is installedNo1
br_cookiesbooleanWhether cookies are enabledNo1
br_colordepthintBit depth of the browser color paletteNo24
br_viewheightintViewport heightNo1000
br_viewwidthintViewport widthNo1000

Event-specific fields​

Snowplow includes specific fields to capture data associated with specific events.

Note that to date, all event types have been defined by Snowplow. Also note that event_vendor values follow the Java package naming convention.

Snowplow currently supports (or will support in the near future) the following event types:

Event typeValue of event field in model
Page views'page_view'
Page pings'page_ping'
Ecommerce transactions'transaction' and 'transaction_item'
Custom structured events'struct'
Custom unstructured events'unstruct'

Details of which fields are available for which events are given below.

Page views​

There are currently no fields that are specific to page_view events: all the fields that are required are part of the standard fields available for any web-based event e.g. page_urlscheme, page_title.

Page pings​

There are four additional fields included with page pings that indicate how a user has scrolled over a web page since the last page ping:

FieldTypeDescriptionReqd?Example
pp_xoffset_minintegerMinimum page x offset seen in the last ping periodNo0
pp_xoffset_maxintegerMaximum page x offset seen in the last ping periodNo100
pp_yoffset_minintegerMinimum page y offset seen in the last ping periodNo0
pp_yoffset_maxintegerMaximum page y offset seen in the last ping periodNo200

Ecommerce transactions​

There are a large number of fields specifically for transaction events.

Fields that start tr_ relate to the transaction as a whole. Fields that start ti_ refer to the specific item included in the transaction. (E.g. a product in the basket.) Single transactions typically span multiple lines of data: there will be a single line where event = transaction, where the tr_ fields are set, and multiple lines (one for each product included) where event = transaction_item and the ti_ fields are set.

FieldTypeDescriptionReqd?Example
tr_orderidtextOrder IDYes'#134'
tr_affiliationtextTransaction affiliation (e.g. store where sale took place)No'web'
tr_totaldecimalTotal transaction valueYes12.99
tr_taxdecimalTotal tax included in transaction valueNo3.00
tr_shippingdecimalDelivery cost chargedNo0.00
tr_total_base*decimalTotal in base currencyNo12.99
tr_tax_base*decimalTotal tax in base currencyNo3.00
tr_shipping_base*decimalDelivery cost in base currencyNo0.00
tr_citytextDelivery address, cityNo'London'
tr_statetextDelivery address, stateNo'Washington'
tr_countrytextDelivery address, countryNo'France'
tr_currencytextCurrencyNo'USD'
ti_orderidtextOrder IDYes'#134'
ti_skutextProduct SKUYes'pbz00123'
ti_nametextProduct nameNo'Cone pendulum'
ti_categorytextProduct categoryNo'New Age'
ti_pricedecimalProduct unit priceYes9.99
ti_price_base*decimalPrice in base currencyNo9.99
ti_quantityintegerNumber of product in transactionYes2
ti_currencytextCurrencyNo'EUR'
base_currency*textReporting currencyNo'GBP'

* Set exclusively by the Currency conversion enrichment.

Custom structured events​

note

Currently, custom events are not available in BDP Cloud.

If you wish to track an event that Snowplow does not recognise as a first class citizen (i.e. one of the events listed above), then you can track them using the generic 'custom structured events'. There are five fields that are available to store data related to custom events:

FieldTypeDescriptionReqd?Example
se_categorytextCategory of eventYes'ecomm', 'video'
se_actiontextAction performed / event nameYes'add-to-basket', 'play-video'
se_labeltextThe object of the action e.g. the ID of the video played or SKU of the product added-to-basketNo'pbz00123'
se_propertytextA property associated with the object of the actionNo'HD', 'large'
se_valuedecimalA value associated with the event / action e.g. the value of goods added-to-basketNo9.99

Custom unstructured events​

note

Currently, custom events are not available in BDP Cloud.

Custom unstructured events are a flexible tool that enable Snowplow users to define their own event types and send them into Snowplow.

When a user sends in a custom unstructured event, they do so as a JSON of name-value properties that conforms to a JSON schema defined for the event earlier.

The unstructured event is not part of the atomic.events table; instead, for users running on Redshift, it is shredded into its own table. The fields in this table will be determined by the JSON schema defined for the event in advance. Users can query just the table for that particular unstructured event, if that's all that's required for their analysis, or join that table back to the atomic.events table by

atomic.my_example_unstructured_event_table.root_id = atomic.events.event_id

Contexts​

Contexts enable Snowplow users to define their own entities that are related to events, and fields that are related to each of those entities. For example, an online retailer may choose to define a user context, to store information about a particular user, which might include data points like the users Facebook ID, age, membership details etc. In addition, they may also define a product context, with product data e.g. SKU, name, created date, description, tags etc.

An event can have any number of custom contexts attached. Each context is passed into Snowplow as a JSON. Additionally, the Snowplow Enrichment process can derive additional contexts.

Contexts are not part of the atomic.events table; instead, for users running on Redshift, Snowplow will shred each context JSON into a dedicated table in the atomic schema, making it much more efficient for analysts to query data passed in in any one of the contexts. Those contexts can be joined back to the core atomic.events table on atomic.my_custom_context_table.root_id = atomic.events.event_id, which is a one-to-one join or a many-to-one join.

Specific unstructured events and custom contexts​

These are also a variety of unstructured events and custom contexts defined by Snowplow. You can find their schemas here.

A note about storage data formats​

Currently, Snowplow data can be stored in S3, Google Cloud Storage (GCS), Redshift, BigQuery, Snowflake, Databricks and PostgreSQL. There are some differences between the structure of data in both formats. These relate to data structures that BigQuery, Snowflake and PostgreSQL support while Redshift does not (e.g. JSON/Record columns). Nevertheless, the structure of both is similar: representing a β€œfat” table.