Skip to main content

Running custom dbt models via Snowplow BDP

info
This documentation only applies to Snowplow BDP. See the feature comparison page for more information about the different Snowplow offerings.

Overview

If you are a Snowplow BDP customer, you can get started with configuring and deploying dbt projects as outlined in the steps below. For more information about setting up your dbt project you can look at the Snowplow dbt docs.

As an initial overview, in your snowplow-pipeline repository, your data models reside in the dbt directory. To start with, your GitHub repository will look like this (you may have additional folders based on your project e.g. dbt_packages or docs):

.
├── datamodeling
| ├── datamodel-name
| └── dbt
| ├── analyses
| ├── logs
| ├── macros
| └── models
| └── my_custom_datamodel
| ├── model_name_1.sql
| └── model_name_2.sql
| ├── seeds
| ├── snapshots
| ├── tests
| ├── README.md
| ├── dbt_project.yml
| └── profiles.yml

Inside the datamodeling directory you can create the data models as subdirectories, giving them an appropriate name. The allowable characters for naming are a-z,0-9,-. Under this you must create a subdirectory named dbt

Each dbt subdirectory contains your dbt project folders, with the dbt_project.yml file at the top level of this folder. For further information about the structure of a dbt project see the dbt docs.

info

When the schedule kicks off, the data model configuration is loaded and validated. A corresponding data modeling DAG is autogenerated accordingly. This DAG will only contain 2 steps, a git clone cache and a dbt run command, dbt manages all model dependencies within the run command.

Read below for more details on the steps to configure and run your dbt data models with Snowplow BDP.

1. Setup your dbt profile

You need to provide your prod connection profile for the warehouse you are connecting to in the profiles.yml file for each datamodel. Ensure that your profile and target are set to prod. See the dbt adapters docs for more specific configuration information for each database.

See dbt connection setup for more information.

profile_name:
outputs:
prod:
type: redshift
threads: [1 or more]
host: [hostname.region.redshift.amazonaws.com]
port: 5439
user: datamodeling
pass: "{{ env_var('REDSHIFT_DATAMODELING_PASSWORD') }}"
dbname: [database name]
schema: [schema name]
target: prod
info

The warehouse password should be sent by secure form from the Snowplow BDP Console in order to set the environment variables.

Required Privileges

In addition to the standard privileges required by dbt, our packages by default write to additional schemas beyond just your profile schema. If your connected user does not have create schema privileges, you will need to ensure that the following schemas exist in your warehouse and the user can create tables in them:

  • <profile_schema>_derived
  • <profile_schema>_scratch
  • <profile_schema>_snowplow_manifest

Alternatively, you can override the output schemas our models write to, see the relevant package configuration page for how to do this.

Please refer to the Official Guide on setting up permissions.

2. The data modeling configuration

Data models can be configured via the Snowplow BDP Console:

In the first step, you can provide the data model name, a description, and the owner(s) of the data model that will be alerted in case of failure. Finally select the engine and version that should run your datamodel (dbt in this case):

When selecting a dbt-core version, below you'll find a table which contains the exact dbt-adapter versions that are bundled with each specific version of dbt-core:

dbt-coreBigQueryDatabricksPostgresRedshiftSnowflake
1.0.01.0.0-1.0.01.0.01.0.0
1.0.81.0.01.0.31.0.81.0.11.0.1
1.1.21.1.11.1.61.1.21.1.01.1.0
1.2.21.2.01.2.41.2.21.2.11.2.0
1.3.01.3.01.3.11.3.01.3.01.3.0
1.4.51.4.21.4.21.4.51.4.01.4.1
1.5.11.5.11.5.41.5.11.5.91.5.1
1.6.91.6.91.6.71.6.91.6.41.6.5
1.7.111.7.71.7.111.7.111.7.51.7.3

Please also note that the model name needs to match to the corresponding datamodeling subfolder name in GitHub. The dbt project for the model in the above example would therefore live under:

your snowplow-pipeline repo > datamodeling > data model name > dbt ...

In the second step, you can then add the schedule:

Please note that the cron schedule is in UTC.

caution

You should write your dbt models in such a way that if they fail part way through they can be re-run without issue. All Snowplow models are written in this way and if you write your models following the custom modules documentation you can take advantage of this.

In the third and last step you can configure the arguments and variables your dbt run call is excecuted with. The arguments (--selector or --select / --exclude) follow the syntax described on dbt docs, and so do variables (--vars flag):

Once you are happy with the configuration, you can create the model. The model will be disabled until you enable it:

Please make sure all your dbt project files are merged to the default branch in GitHub before enabling the model. Any changes merged to the default branch are available immediately.

3. Model execution

Once everything is set up, Snowplow BDP Console will run the following commands in this order:

  1. dbt deps (if a packages.yml file is present)
  2. dbt seed
  3. dbt snapshot
  4. dbt run
  5. dbt test

This ensures that the correct package dependencies are installed, that seeds are uploaded and refreshed, that snapshots are taken, that the dbt models are created, and that all specified tests are run.

4. Monitor your model in the Snowplow BDP Console

After everything has been set up and has executed, you can now monitor your data models running against your data warehouse from the Snowplow BDP Console, in the Jobs UI! There you can see the data modeling DAG generated, and monitor the status, duration and run times of the data model. You can also browse through the logs that dbt generates during it's runs. If all seeds, snapshots, models, and tests pass you will see the Result: SUCCEEDED status in the Jobs UI. If any of the steps fail (including tests that result in a warning), you will see the Result: FAILED status.