Skip to main content

Configuration

info

This page details general configurations that can apply across many of our packages, each package has specific configuration variables that define how the models run, please see each child page for the specifics of each package.

Variables

tip

Do not copy the project yaml from the package into your main project, this can lead to issues. Only add variables and model configurations that you wish to change from the default behavior. Each configuration page contains a UI to help to build your variables.

caution

When using multiple dbt packages you must be careful to specify which scope a variable or configuration is defined within. In general, always specify each value in your dbt_project.yml nested under the specific package e.g.

dbt_project.yml
vars:
snowplow_web:
snowplow__atomic_schema: schema_with_snowplow_web_events
snowplow_mobile:
snowplow__atomic_schema: schema_with_snowplow_mobile_events

You can read more about variable scoping in dbt's docs around variable precedence.

Warehouse specific configurations

Postgres

To optimism performance of large Postgres datasets you can create indexes in your dbt model config for columns that are commonly used in joins or where clauses. For example:

# snowplow_web_sessions_custom.sql
{{
config(
...
indexes=[{'columns': [‘domain_sessionid’], 'unique': True}]
)
}}

Spark

For Spark environments, Iceberg is currently the supported file format for external tables. We have successfully tested this setup using both Glue and Thrift as connection methods. To use these models, create an external table from the Iceberg lake format in Spark and point your dbt model to this table.

Here's an example profiles.yml configuration for Spark using Thrift:

spark:
type: spark
host: localhost
method: thrift
port: 10000
schema: default

In your dbt_project.yml, the file_format is set to iceberg by default for Spark. While you can override this in your project's dbt YAML file to use a different file format, please note that Iceberg is currently the only officially supported format.

Databricks

You can connect to Databricks using either the dbt-spark or the dbt-databricks connectors. The dbt-spark adapter does not allow dbt to take advantage of certain features that are unique to Databricks, which you can take advantage of when using the dbt-databricks adapter. Where possible, we would recommend using the dbt-databricks adapter.

Unity Catalog support

With the rollout of Unity Catalog (UC), the dbt-databricks adapter has added support in dbt for the three-level-namespace as of dbt-databricks>=1.1.1. As a result of this, we have introduced the snowplow__databricks_catalog variable which should be used if your Databricks environment has UC enabled, and you are using a version of the dbt-databricks adapter that supports UC. The default value for this variable is hive_metastore which is also the default name of your UC, but this can be changed with the snowplow__databricks_catalog variable.

Since there are many different situations, we've created the following table to help guide your setup process (this should help resolve the Cannot set database in Databricks! error):

Adapter supports UC and UC EnabledAdapter supports UC and UC not enabledAdapter does not support UC
Events land in default atomic schemasnowplow__databricks_catalog = '{name_of_catalog}'Nothing neededsnowplow__databricks_catalog = 'atomic'
Events land in custom schema (not atomic)snowplow__atomic_schema = '{name_of_schema}' snowplow__databricks_catalog = '{name_of_catalog}'snowplow__atomic_schema = '{name_of_schema}'snowplow__atomic_schema = '{name_of_schema}' snowplow__databricks_catalog = '{name_of_schema}'

Optimization of models

The dbt-databricks adapter allows our data models to take advantage of the auto-optimization features in Databricks. If you are using the dbt-spark adapter, you will need to manually alter the table properties of your derived and manifest tables using the following command after running the data model at least once. You will need to run the command in your Databricks environment once for each table, and we would recommend applying this to the tables in the _derived and _snowplow_manifest schemas:

ALTER TABLE {TABLE_NAME} SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true);

BigQuery

As mentioned in the Quickstart, many of our packages allow you to specify which column your events table is partitioned on using the snowplow__session_timestamp variable. It will likely be partitioned on collector_tstamp or derived_tstamp. If it is partitioned on collector_tstamp you should set snowplow__derived_tstamp_partitioned to false. This will ensure only the collector_tstamp column is used for partition pruning when querying the events table:

dbt_project.yml
vars:
snowplow_mobile:
snowplow__derived_tstamp_partitioned: false