Databricks Streaming Loader

Early release

This is an early release. It relies on Databricks features that are in preview as of August 2025, e.g. File Events.

Also, you will need a premium Databricks plan to use Lakeflow Declarative Pipelines.

The Databricks Streaming Loader is an application that integrates with a Databricks Lakeflow Declarative Pipeline to load Snowplow events into Databricks with low latency.

AWS
GCP
Azure

There are two parts to how the Databricks Streaming Loader works.

In the first part, you use Snowplow's Databricks Streaming Loader to push staging files into a Unity Catalog volume.

In the second part, you use a Databricks Lakeflow Declarative Pipeline to load the staging files into a Streaming Live Table.

The Databricks Streaming Loader is published as a Docker image which you can run on any AWS VM.

bash
docker pull snowplow/databricks-loader-kinesis:0.4.0

To run the loader, mount your config file into the docker image, and then provide the file path on the command line. We recommend setting your Databricks credentials via environment variables, e.g. DATABRICKS_CLIENT_SECRET, so that you can refer to them in the config file.

bash
docker run \
--mount=type=bind,source=/path/to/myconfig,destination=/myconfig \
--env DATABRICKS_CLIENT_ID="${DATABRICKS_CLIENT_ID}" \
--env DATABRICKS_CLIENT_SECRET="${DATABRICKS_CLIENT_SECRET}" \
snowplow/databricks-loader-kinesis:0.4.0 \
--config=/myconfig/loader.hocon

Where loader.hocon is the loader's configuration file and iglu.hocon is the iglu resolver configuration.

Schemas in Databricks

For more information on how events are stored in Databricks, check the mapping between Snowplow schemas and the corresponding Databricks column types.

Configuring the loader#

The loader config file is in HOCON format, and it allows configuring many different properties of how the loader runs.

The simplest possible config file just needs a description of your pipeline inputs and outputs:

AWS
GCP
Azure

jsonconfig/config.kinesis.minimal.hocon
loading...

View on GitHub

jsonconfig/config.pubsub.minimal.hocon
loading...

View on GitHub

jsonconfig/config.kafka.minimal.hocon
loading...

View on GitHub

See the configuration reference for all possible configuration parameters.

Iglu#

The Databricks Streaming Loader requires an Iglu resolver file which describes the Iglu repositories that host your schemas. This should be the same Iglu configuration file that you used in the Enrichment process.

Configuring the Databricks Lakeflow Declarative Pipeline#

Create a Pipeline in your Databricks workspace, and copy the following SQL into the associated .sql file:

sql
CREATE STREAMING LIVE TABLE events
CLUSTER BY (load_tstamp, event_name)
TBLPROPERTIES (
  'delta.dataSkippingStatsColumns' =
      'load_tstamp,collector_tstamp,derived_tstamp,dvce_created_tstamp,true_tstamp,event_name'
)
AS SELECT
  *,
  current_timestamp() as load_tstamp
FROM cloud_files(
  "/Volumes/<CATALOG_NAME>/<VOLUME_NAME>/<SCHEMA_NAME>/events",
  "parquet",
  map(
    "cloudfiles.inferColumnTypes", "false",
    "cloudfiles.includeExistingFiles", "false", -- set to true to load files already present in the volume
    "cloudfiles.schemaEvolutionMode", "addNewColumns",
    "cloudfiles.partitionColumns", "",
    "cloudfiles.useManagedFileEvents", "true",
    "datetimeRebaseMode", "CORRECTED",
    "int96RebaseMode", "CORRECTED",
    "mergeSchema", "true"
  )
)

Replace /Volumes/<CATALOG_NAME>/<VOLUME_NAME>/<SCHEMA_NAME>/events with the correct path to your volume.

Note that the volume must be an external volume in to order to use the cloud files option cloudfiles.useManagedFileEvents, which is highly recommended for this integration.

Metrics#

The Databricks Streaming Loader can be configured to send the following custom metrics to a StatsD receiver:

Metric	Definition
`events_good`	A count of events that are successfully written to the Databricks volume.
`events_bad`	A count of failed events that could not be loaded, and were instead sent to the bad output stream.
`latency_millis`	The time in milliseconds from when events are written to the source stream of events (i.e. by Enrich) until when they are read by the loader.
`e2e_latency_millis`	The end-to-end latency of the snowplow pipeline. The time in milliseconds from when an event was received by the collector, until it is written to the Databricks volume.

See the monitoring.metrics.statsd options in the configuration reference for how to configure the StatsD receiver.

Telemetry notice

By default, Snowplow collects telemetry data for Databricks Streaming Loader (since version 0.1.0). Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the telemetry.userProvidedId configuration setting.

If you wish to disable telemetry, you can do so by setting telemetry.disable to true.

See our telemetry principles for more information.

Configuring the loader#

Iglu#

Configuring the Databricks Lakeflow Declarative Pipeline#

Metrics#

Want to see a custom demo?