Databricks
The Databricks integration is available for Snowplow pipelines running on AWS, Azure and GCP.
The Snowplow Databricks integration allows you to load enriched event data (as well as failed events) into your Databricks environment for analytics, data modeling, and more.
Depending on the cloud provider for your Snowplow pipeline, there are different options for this integration:
Integration | AWS | Azure | GCP | Failed events support |
---|---|---|---|---|
Direct, batch-based (RDB Loader) | ✅ | ❌ | ❌ | ❌ |
Via Delta Lake (Lake Loader) | ❌* | ✅ | ✅ | ✅ |
Early release: Streaming / Lakeflow (Streaming Loader) | ✅ | ✅ | ✅ | ✅ |
*Delta+Databricks combination is currently not supported for AWS pipelines. The loader uses DynamoDB tables for mutually exclusive writes to S3, a feature of Delta. Databricks, however, does not support this (as of September 2025). This means that it’s not possible to alter the data via Databricks (e.g. to run OPTIMIZE
or to delete PII).
What you will need
Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process.
The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration.
Keep in mind that you will need to be able to do a few things.
- Batch-based (AWS)
- Via Delta Lake (Azure, GCP)
- Streaming
- Provide a Databricks cluster along with its URL
- Specify the Unity catalog name and schema name
- Create an access token with the following permissions:
USE CATALOG
on the catalogUSE SCHEMA
andCREATE TABLE
on the schemaCAN USE
on the SQL warehouse
Getting started
You can add a Databricks destination through the Snowplow Console.
- Batch-based (AWS)
- Via Delta Lake (Azure, GCP)
- Streaming
(For self-hosted customers, please refer to the Loader API reference instead.)
Step 1: Create a connection
- In Console, navigate to Destinations > Connections
- Select Set up connection
- Choose Loader connection, then Databricks
- Follow the steps to provide all the necessary values
- Click Complete setup to create the connection
Step 2: Create a loader
- In Console, navigate to Destinations > Destination list. Switch to the Available tab and select Databricks
- Select a pipeline: choose the pipeline where you want to deploy the loader.
- Select your connection: choose the connection you configured in step 1.
- Select the type of events: enriched events or failed events
- Click Continue to deploy the loader
You can review active destinations and loaders by navigating to Destinations > Destination list.
How loading works
The Snowplow data loading process is engineered for large volumes of data. In addition, our loader applications ensure the best representation of Snowplow events. That includes automatically adjusting the tables to account for your custom data, whether it's new event types or new fields.
- Batch-based (AWS)
- Via Delta Lake (Azure, GCP)
- Streaming
For more details on the loading flow, see the RDB Loader reference page, where you will find additional information and diagrams.
Snowplow data format in Databricks
All events are loaded into a single table (events
).
There are dedicated columns for atomic fields, such as app_id
, user_id
and so on:
app_id | collector_tstamp | ... | event_id | ... | user_id | ... |
---|---|---|---|---|---|---|
website | 2025-05-06 12:30:05.123 | ... | c6ef3124-b53a-4b13-a233-0088f79dcbcb | ... | c94f860b-1266-4dad-ae57-3a36a414a521 | ... |
Snowplow data also includes customizable self-describing events and entities. These use schemas to define which fields should be present, and of what type (e.g. string, number).
For self-describing events and entities, there are additional columns, like so:
app_id | ... | unstruct_event_com_acme_button_press_1 | contexts_com_acme_product_1 |
---|---|---|---|
website | ... | data for your custom button_press event (as STRUCT ) | data for your custom product entities (as ARRAY of STRUCT ) |
Note:
- "unstruct[ured] event" and "context" are the legacy terms for self-describing events and entities, respectively
- the
_1
suffix represents the major version of the schema (e.g.1-x-y
)
You can learn more in the API reference section.
Check this guide on querying Snowplow data.