Skip to main content

Querying Snowplow data

Basic queries​

You will typically find most of your Snowplow data in the events table. If you are using Redshift or Postgres, there will be extra tables for self-describing events and entities — see below.

Please refer to the structure of Snowplow data for the principles behind our approach, as well as the descriptions of the various standard columns.

Data models

Querying the events table directly can be useful for exploring your events or building custom analytics. However, for many common use cases it’s much easier to use our data models, which provide a pre-aggregated view of your data.

The simplest query could look like this:

SELECT * FROM <events>
WHERE event_name = 'page_view'

You will need to replace <events> with the appropriate location — the database, schema and table name will depend on your setup. See this first steps section for details.

caution

With large data volumes (read: any production system), you should always include a filter on the partition key (normally, collector_tstamp), for example:

WHERE ... AND collector_tstamp between timestamp '2023-10-23' and timestamp '2023-11-23'

This ensures that you read from the minimum number of (micro-)partitions necessary, making the query run much faster and reducing compute cost (where applicable).

Self-describing events​

Self-describing events can contain their own set of fields, defined by their schema.

For Redshift and Postgres users, self-describing events are not part of the standard events table. Instead, each type of event is in its own table. The table name and the fields in the table will be determined by the event’s schema. See how schemas translate to the warehouse for more details.

You can query just the table for that particular self-describing event, if that's all that's required for your analysis, or join that table back to the events table:

SELECT
...
FROM
<schema>.<events> ev
LEFT JOIN
<schema>.my_example_event_table sde
ON sde.root_id = ev.event_id AND sde.root_tstamp = ev.collector_tstamp
caution

You may need to take care of duplicate events.

Entities​

Entities (also known as contexts) provide extra information about the event, such as data describing a product or a user.

For Redshift and Postgres users, entities are not part of the standard events table. Instead, each type of entity is in its own table. The table name and the fields in the table will be determined by the entity’s schema. See how schemas translate to the warehouse for more details.

The entities can be joined back to the core events table by the following, which is a one-to-one join (for a single record entity) or a one-to-many join (for a multi-record entity), assuming no duplicates.

SELECT
...
FROM
<schema>.<events> ev
LEFT JOIN -- assumes no duplicates, and will return all events regardless of if they have this entity
<schema>.my_entity ent
ON ent.root_id = ev.event_id AND ent.root_tstamp = ev.collector_tstamp
caution

You may need to take care of duplicate events.

Failed events​

See Exploring failed events.

Dealing with duplicates​

In some cases, your data might contain duplicate events (full deduplication before the data lands in the warehouse is optionally available for Redshift, Snowflake and Databricks on AWS).

While our data models deal with duplicates for you, there may be cases where you need to de-duplicate the events table yourself.

In Redshift/Postgres you must first generate a ROW_NUMBER() on your events and use this to de-duplicate.

WITH unique_events AS (
SELECT
...
ROW_NUMBER() OVER (PARTITION BY a.event_id ORDER BY a.collector_tstamp) AS event_id_dedupe_index
FROM
<events> a
)

SELECT
...
FROM
unique_events
WHERE
event_id_dedupe_index = 1

Things get a little more complicated if you want to join your event data with a table containing entities.

Suppose your entity is called my_entity. If you know that each of your events has at most 1 such entity attached, the de-duplication requires the use of a row number over event_id to get each unique event:

WITH unique_events AS (
SELECT
ev.*,
ROW_NUMBER() OVER (PARTITION BY a.event_id ORDER BY a.collector_tstamp) AS event_id_dedupe_index
FROM
<schema>.<events> ev
),

unique_my_entity AS (
SELECT
ent.*,
ROW_NUMBER() OVER (PARTITION BY a.root_id ORDER BY a.root_tstamp) AS my_entity_index
FROM
<schema>.my_entity_1 ent
)

SELECT
...
FROM
unique_events u_ev
LEFT JOIN
unique_my_entity u_ent
ON u_ent.root_id = u_ev.event_id AND u_ent.root_tstamp = u_ev.collector_tstamp AND u_ent.my_entity_index = 1
WHERE
u_ev.event_id_dedupe_index = 1

If your events might have more than one my_entity attached, the logic is slightly more complex.

Details

First, de-duplicate the events table in the same way as above, but also keep track of the number of duplicates (see event_id_dedupe_count below). In the entity table, generate a row number per unique combination of all fields in the record. Then join on root_id and root_tstamp as before, but with an additional clause that the row number is a multiple of the number of duplicates, to support the 1-to-many join. This ensures all duplicates are removed while retaining all original records of the entity. This may look like a weird join condition, but it works.

Unfortunately, listing all fields manually can be quite tedious, but we have added support for this in the de-duplication logic of our dbt packages.

WITH unique_events AS (
SELECT
ev.*,
ROW_NUMBER() OVER (PARTITION BY a.event_id ORDER BY a.collector_tstamp) AS event_id_dedupe_index,
COUNT(*) OVER (PARTITION BY a.event_id) AS event_id_dedupe_count
FROM
<schema>.<events> ev
),

unique_my_entity AS (
SELECT
ent.*,
ROW_NUMBER() OVER (PARTITION BY a.root_id, a.root_tstamp, ... /*all columns listed here for your entity */ ORDER BY a.root_tstamp) AS my_entity_index
FROM
<schema>.my_entity_1 ent
)

SELECT
...
FROM
unique_events u_ev
LEFT JOIN
unique_my_entity u_ent
ON u_ent.root_id = u_ev.event_id AND u_ent.root_tstamp = u_ev.collector_tstamp AND mod(u_ent.my_entity_index, u_ev.event_id_dedupe_count) = 0
WHERE
u_ev.event_id_dedupe_index = 1