Normalize Quickstart
Requirements
In addition to dbt being installed:
- Python 3.7 or later
Installation
Make sure to create a new dbt project and import this package via the packages.yml
as recommended by dbt, or add to an existing top level project. Do not fork the packages themselves.
Check dbt Hub for the latest installation instructions, or read the dbt docs for more information on installing packages. If you are using multiple packages you may need to up/downgrade a specific package to ensure compatibility.
packages:
- package: snowplow/snowplow_normalize
version: 0.4.0
Make sure to run the dbt deps
command after updating your packages.yml
to ensure you have the specified version of each package installed in your project.
Setup
1. Override the dispatch order in your project
To take advantage of the optimized upsert that the Snowplow packages offer you need to ensure that certain macros are called from snowplow_utils
first before dbt-core
. This can be achieved by adding the following to the top level of your dbt_project.yml
file:
dispatch:
- macro_namespace: dbt
search_order: ['snowplow_utils', 'dbt']
If you do not do this the package will still work, but the incremental upserts will become more costly over time.
2. Adding the selectors.yml
file
Within the packages we have provided a suite of suggested selectors to run and test the models within the package together with the normalize model. This leverages dbt's selector flag. You can find out more about each selector in the YAML Selectors section.
These are defined in the selectors.yml
file (source) within the package, however in order to use these selections you will need to copy this file into your own dbt project directory. This is a top-level file and therefore should sit alongside your dbt_project.yml
file. If you are using multiple packages in your project you will need to combine the contents of these into a single file.
3. Check source data
This package will by default assume your Snowplow events data is contained in the atomic
schema of your target.database. In order to change this, please add the following to your dbt_project.yml
file:
vars:
snowplow_normalize:
snowplow__atomic_schema: schema_with_snowplow_events
snowplow__database: database_with_snowplow_events
Please note that your target.database
is NULL if using Databricks. In Databricks, schemas and databases are used interchangeably and in the dbt implementation of Databricks therefore we always use the schema value, so adjust your snowplow__atomic_schema
value if you need to.
4. Filter your data set
You can specify both start_date
at which to start processing events and the app_id
's to filter for. By default the start_date
is set to 2020-01-01
and all app_id
's are selected. To change this please add the following to your dbt_project.yml
file:
vars:
snowplow_normalize:
snowplow__start_date: 'yyyy-mm-dd'
snowplow__app_id: ['my_app_1','my_app_2']
If you have events you are going to normalize with no value for the dvce_sent_tstamp
field, you need to disable the days late filter by setting the snowplow__days_late_allowed
variable to -1
, otherwise these events will not be processed.
5. Install additional python packages
The script only requires 2 additional packages (jsonschema
and requests
) that are not built into python by default, you can install these by running the below command, or by installing them by your preferred method.
pip install -r dbt_packages/snowplow_normalize/utils/requirements.txt
6. Setup the generator configuration file
You can use the example provided in utils/example_normalize_config.json
to start your configuration file to specify which events, self-describing events, and contexts you wish to include in each table. For more information on this file see the normalize package docs.
7. Setup your resolver connection file (optional)
If you are not using iglu central as your only iglu registry then you will need to set up an iglu resolver file and point to this in your generator config.
8. Generate your models
At the root of your dbt project, running python dbt_packages/snowplow_normalize/utils/snowplow_normalize_model_gen.py path/to/your/config.json
will generate all models specified in your configuration.
9. Additional vendor specific configuration
Verify which column your events table is partitioned on. It will likely be partitioned on collector_tstamp
or derived_tstamp
. If it is partitioned on collector_tstamp
you should set snowplow__derived_tstamp_partitioned
to false
. This will ensure only the collector_tstamp
column is used for partition pruning when querying the events table:
vars:
snowplow_normalize:
snowplow__derived_tstamp_partitioned: false
Add the following variable to your dbt project's dbt_project.yml
file
vars:
snowplow_normalize:
snowplow__databricks_catalog: 'hive_metastore'
Depending on the use case it should either be the catalog (for Unity Catalog users from databricks connector 1.1.1 onwards, defaulted to 'hive_metastore') or the same value as your snowplow__atomic_schema
(unless changed it should be 'atomic'). This is needed to handle the database property within models/base/src_base.yml
.
A more detailed explanation for how to set up your Databricks configuration properly can be found in Unity Catalog support.
10. Change the default partition timestamp (optional)
The package uses a configurable partition timestamp column, controlled by the snowplow__partition_tstamp
variable:
vars:
snowplow__partition_tstamp: "collector_tstamp" # Default value, any change should be a timestamp
The purpose of this variable is to adjust the partitioning of the derived tables to use a different timestamp (e.g., derived_tstamp) that is more suitable for analytics in the next layer.
If you change snowplow__partition_tstamp
to a different column (e.g., "loader_tstamp"), you MUST ensure that this column is included in the event_columns
list in your normalize configuration for each event. Failing to do so will cause the models to fail, as the partition column must be present in the normalized output.
Example configuration when using a custom partition timestamp:
{
"events": [
{
"event_names": ["page_view"],
"event_columns": [
"domain_userid",
"loader_tstamp", // Must include your custom partition timestamp here
"app_id"
],
// ... rest of configuration
}
]
}
11. Run your model(s)
You can now run your models for the first time by running the below command (see the operation page for more information on operation of the package):
dbt run --selector snowplow_normalize