Adding Fields to a Derived Table
You may often wish to add fields onto the derived tables within our packages; perhaps there is some additional dimension you need for your analysis, or some extra calculation you wish to perform. In most cases this is relatively straight forward to do, and even in the most complex case you don't need to alter the logic for the derived table itself.
Option 1: Passthrough fieldsโ
Where possible, if the field you want to add is attached to the original event for that table (e.g. the page/screen view for views) then you can make use of the passthrough fields feature in our packages.
Option 2: Custom aggregationsโ
If you require aggregation at the level of the table (e.g. session identifier for sessions) you can make use of the custom aggregations feature in our packages that support it.
Option 3: Custom derived tableโ
If none of the above options are suitable, the best approach is to build a custom version of our derived table, and read from the existing this run table to then add in your additional fields.
For Redshift, if you need any additional self-describing event or entity fields in the events this run table, check out the modeling entities page for how to add these.
As an example, let's say we want to add some field to the sessions table in the unified package, that for simplicity is just going to be an is_test
calculation based on the app_id.
Disable the derived model in the packageโ
The first step is to disable the derived model in the package itself. You do this in your top level project yaml:
...
models:
snowplow_unified:
sessions:
snowplow_unified_sessions:
+enabled: false
Add a new derived modelโ
Next we add a new model to replace the one we just disabled; the easiest thing to do is just copy the contents of the model we just disabled. While you could name this anything, it's easiest for downstream use cases to keep the original name
{{
config(
tags=['snowplow_unified_incremental', 'derived'],
...
)
}}
select *
{% if target.type in ['databricks', 'spark'] -%}
, DATE(start_tstamp) as start_tstamp_date
{%- endif %}
from {{ ref('snowplow_unified_sessions_this_run') }}
where {{ snowplow_utils.is_run_with_new_events('snowplow_unified') }} --returns false if run doesn't contain new events.
Modify the new model as neededโ
{{
config(
...
)
}}
select *
, case when app_id like '%_test' then 'test' else 'prod' end as app_type
{% if target.type in ['databricks', 'spark'] -%}
, DATE(start_tstamp) as start_tstamp_date
{%- endif %}
from {{ ref('snowplow_unified_sessions_this_run') }}
where {{ snowplow_utils.is_run_with_new_events('snowplow_unified') }} --returns false if run doesn't contain new events.
If you are using BigQuery, you should look at the combine column versions macro we provide to automatically combine minor versions of your schemas in the model.
(Optional) Backfill the modelโ
Follow the steps to backfill models the model if you want this field to be populated historically.