Databricks loader
Setting up Databricksβ
The following resources need to be created:
The events
table and the database schema will be created automatically by the loader. You can configure the name of the database schema with the storage.schema
config field. The table name (events
) canβt be changed.
Keep in mind that the Databricks Loader database user needs to have permissions to create schemas on the given database to be able to perform this operation. Check this page for more information about granting privileges in Databricks. You can also create the schema manually if you prefer.
Downloading the artifactβ
The asset is published as a jar file attached to the Github release notes for each version.
It's also available as a Docker image on Docker Hub under snowplow/rdb-loader-databricks:6.1.2
.
Configuring rdb-loader-databricks
β
The loader takes two configuration files:
- a
config.hocon
file with application settings - an
iglu_resolver.json
file with the resolver configuration for your Iglu schema registry.
An example of the minimal required config for the Databricks loader can be found here and a more detailed one here. For details about each setting, see the configuration reference.
See here for details on how to prepare the Iglu resolver file.
All self-describing schemas for events processed by RDB Loader must be hosted on Iglu Server 0.6.0 or above. Iglu Central is a registry containing Snowplow-authored schemas. If you want to use them alongside your own, you will need to add it to your resolver file. Keep it mind that it could override your own private schemas if you give it higher priority. For details on this see here.
Running the Databricks loaderβ
The two config files need to be passed in as base64-encoded strings:
$ docker run snowplow/rdb-loader-databricks:6.1.2 \
--iglu-config $RESOLVER_BASE64 \
--config $CONFIG_BASE64
Telemetry notice
By default, Snowplow collects telemetry data for Databricks Loader (since version 5.0.0). Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).
This data is anonymous and minimal, and since our code is open source, you can inspect whatβs collected.
If you wish to help us further, you can optionally provide your email (or just a UUID) in the telemetry.userProvidedId
configuration setting.
If you wish to disable telemetry, you can do so by setting telemetry.disable
to true
.
See our telemetry principles for more information.