Clear History

Model your pipeline data

At this stage you should:

  • Have tracking and enrichment set-up
  • Have some data in the ATOMIC.EVENTS table
  • Enabled IAB, UA parser and YAUAA enrichments
  • Have a working dbt project with the web model configurations for the sample data

Step 1: Complete refresh of your Snowplow web package (Optional)

If you would like to use your current dbt environment that you set-up during modeling the sample data you might want to start from scratch.

While you can drop and recompute the incremental tables within this package using the standard --full-refresh flag, all manifest tables are protected from being dropped in production. Without dropping the manifest during a full refresh, the selected derived incremental tables would be dropped but the processing of events would resume from where the package left off (as captured by the snowplow_web_incremental_manifest table) rather than your snowplow__start_date.

In order to drop all the manifest tables and start again set the snowplow__allow_refresh variable to true at run time:

dbt run --select snowplow_web tag:snowplow_web_incremental --full-refresh --vars 'snowplow__allow_refresh: true'
# or using selector flag
dbt run --selector snowplow_web --full-refresh --vars 'snowplow__allow_refresh: true'

Step 2: Modify variables

Assuming that you followed the guide on how to run the data model on the sample data, here we will only highlight the differences in the set-up:

  • Remove the snowplow__events variable. This time the base table will be the default atomic.events, therefore no need to overwrite it.

  • Change the snowplow__start_date variable according to the data you have in your events table.

Optional:

  • snowplow__backfill_limit_days: The maximum number of days of new data to be processed since the latest event processed. Set it to 1.

We suggest changing snowplow__backfill_limit_days to 1 whilst working in your dev environment initially so that you can test how your incremental runs work. You will only have a few days of data available at this stage and if you leave it at the default 30 days, you will model all your data in one go.

Step 3: Run the model

Execute the following either through your CLI or from within dbt Cloud

dbt run --selector snowplow_web

Note there is no need to re-seed the data unless you change the schema names or dropped the tables in your warehouse

Depending on the period of data available since the snowplow__start_date and the snowplow__backfill_limit_days variable you might not process all your data during your first run. Each time the model runs it should display the period it processes and the timestamp of the last event processed for each model within the package. This gets saved in the snowplow__incremental_manifest table so you can always check the data processing state (see below).

manifest

Step 4: Run dbt test

Run our recommended selector specified tests to identify potential issues with the data:

dbt test --selector snowplow_web_lean_tests