Custom models
The Snowplow packages are designed to be easily customized or extended within your own dbt project by building your own custom models. The standard models we provide (base, page/screen views, sessions and users) are not designed to be modified by you.
In general there are 3 types of custom models you may end up building:
- Drop+recompute/view models: these models build from the derived tables but require a Drop+recompute each time they are built. This could include things like daily aggregations or rolling views.
- Incremental models: these models build from the existing (or custom) this run tables(s) and are incrementally upserted to each run. This could include things like a custom sessions table with additional user information. These are the most common types of custom models.
- This run models: these models build from the events this run table that contains all columns for events being (re)processed in the run. These can be used to feed custom incremental models.
Alternatively, if you wish to only use the events this run logic and build everything on top of this yourself, you may be better looking at the using different sessionisation example for how to set up a new package from scratch.
The best way to learn is by doing, so check out the Examples section for specific custom models.
Guidelines & Best Practice
The standard models carry out the heavy lifting in establishing an incremental structure and providing the core logic for the most common aggregation use cases. It also allows custom models to be plugged in without impeding the maintenance of standard models.
The following best practices should be followed to ensure that updates and bug fixes to the model can be rolled out with minimal complication:
- Custom models should not modify any of the tables generated by the Snowplow packages i.e. the scratch, derived or manifest tables.
- The logic for custom SQL should be idempotent, and restart-safe - i.e. it should be written in such a way that a failure mid-way, or a re-run of the model will not change the deterministic output.
- Redshift/Postgres users: be careful about joining mutli-valued entities to any events table. Deduplication logic needs to be applied to avoid many-to-many joins. See the using multi-valued entities example for how to manage this.
In short, the standard models can be treated as the source code for a distinct piece of software, and custom models can be treated as self-maintained, additive plugins - in much the same way as a Java package may permit one to leverage public classes in their own API, and provide an entry point for custom programs to run, but will not permit one to modify the original API.
The this run and derived tables are considered part of the 'public' class of tables in this model structure, and so we can give assurances that non-breaking releases won't alter them. The other tables may be used in custom SQL, but their logic and structure may change from release to release, or they may be removed.
Custom model usage
An overview of the detail provided below is available in this summary table:
Model Type | Materialization | Requires Tagging | Requires backfilling | Requires specific refreshing | Use is run with new events | Use snowplow_optimize |
---|---|---|---|---|---|---|
Drop+recompute/ view | table /view | ❌ | ❌ | ❌ | ❌ | ❌ |
Incremental | incremental | ✅ | ✅ | ✅ | ✅ | ✅ |
This Run | table /view / ephemeral | ✅ | ❌ | ❌ | ❌ | ❌ |
Tagging Models
Any incremental model, or one that builds from the events this run table, in the package requires tagging so that it can be tracked in the manifest table as part of the incremental sessionisation logic. If you do not correctly tag the model you may get an error, or your package may get stuck in State 2.
The model(s) should be tagged with snowplow_<package>_incremental
; the easiest way to achieve this is to place any custom models of this type in a specific directory in your project and tag the whole folder in your project yaml:
models:
your_dbt_project:
snowplow_<package>_custom_models:
+tags: snowplow_<package>_incremental #Adds tag to all models in the 'snowplow_<package>_custom_models' directory
Note that because Drop+recompute models are not tagged, they will not be run when using the snowplow_<package>
selector.
Using is_run_with_new_events
Incremental type custom models, those that reference a _this_run
type table should make use of the is_run_with_new_events
macro to only process the table when new events are available in the current run.
select
...
from {{ ref('snowplow_<package>_<table>_this_run') }}
where {{ snowplow_utils.is_run_with_new_events('snowplow_<package>') }} --returns false if run doesn't contain new events.
Using snowplow_optimize
To make use of our efficient upsert you should add snowplow_optimize=true
to your custom model config for incremental models, setting an appropriate upsert_date_key
and unique_id
.
{{
config(
materialized='incremental',
upsert_date_key='my_date_field',
unique_key='unique_id_field',
snowplow_optimize=true,
...
)
}}
select
...
Backfilling Custom Models
See our page on backfilling.
Refreshing Custom Models
See our page on full or partial refreshes.
Retiring Custom Models
If you want to retire a custom model, you should delete the models from your project or disable the models.
There is no need to remove the models from the snowplow_<package>_incremental_manifest
manifest table. The packages identifies enabled models tagged with snowplow_<package>_incremental
within your project and selects these models from the manifest in order to calculate the state of the package as described above.
Do NOT just use --exclude
on the retired models from your job in production. Currently the packages is unable to identify which models are due to be executed in a given run. As a result, if you exclude a model the package will get stuck in State 3 and continue to attempt to sync your excluded with the remaining models.
Tips for developing custom models
Most of these tips apply only to Incremental type custom models
While developing custom models you may benefit from the following:
- Minimizing the amount of data being processed to reduce cost & run time.
- Use recent events from your events table to ensure you have all the latest contexts and event types available.
- BigQuery only: Automatic handling of evolving schemas for custom contexts and unstructured events.
1. Reducing Costs
By setting snowplow__backfill_limit_days
to 1 in your dbt_project.yml
file you will only process a days worth of data per run.
We have provided the get_value_by_target
macro to dynamically switch the backfill limit depending on your environment i.e. dev vs. prod, with your environment determined by your target name:
vars:
snowplow_<package>:
snowplow__backfill_limit_days: "{{ snowplow_utils.get_value_by_target(
dev_value=1,
default_value=30,
dev_target_name='dev') }}"
2. Using Recent Data
This can be achieved by setting snowplow__start_date
to a recent date. If a custom model only models newly tracked events, you should set this to the date that data started to be produced instead of your initial pipeline date. To dynamically change the start date depending on your environment, you can use the following:
vars:
snowplow_<package>:
snowplow__start_date: "{{ snowplow_utils.get_value_by_target(
dev_value=snowplow_utils.n_timedeltas_ago(1, 'weeks'),
default_value='2020-01-01',
dev_target_name='dev') }}"
3. Handling of schema evolution
As your schemas for such custom contexts and unstructured events evolve, multiple versions of the same column will be created in your events table e.g. custom_context_1_0_0
, custom_context_1_0_1
. These columns contain nested fields i.e. are of a datatype RECORD
. When modeling Snowplow data it can be useful to combine or coalesce each nested field across all versions of the column for a continuous view over time.
The snowplow-utils package provides the combine_column_versions macro, which will automatically coalesce the fields within every version of the specified column. This mitigates the need for you to update your models every time a new column version is created.
Please refer to the snowplow-utils docs for the full documentation on these macros.