RDB Loader configuration reference

The configuration reference in this page is written for RDB Loader 5.0.0 or higher.

The configuration reference pages for previous versions can be found here.

Minimal Configuration	Extended Configuration
aws/redshift.config.minimal.hocon	aws/redshift.config.reference.hocon
aws/snowflake.config.minimal.hocon	aws/snowflake.config.reference.hocon
aws/databricks.config.minimal.hocon	aws/databricks.config.reference.hocon
gcp/snowflake.config.minimal.hocon	gcp/snowflake.config.reference.hocon
azure/snowflake.config.minimal.hocon	azure/snowflake.config.reference.hocon

All applications use a common module for core functionality, so only the storage sections are different in their config.

License

Since version 6.0.0, RDB Loader is released under the Snowplow Limited Use License (FAQ).

To accept the terms of license and run RDB Loader, set the ACCEPT_LIMITED_USE_LICENSE=yes environment variable. Alternatively, you can configure the license.accept option, like this:

license {
  accept = true
}

Redshift Loader `storage` section

Parameter	Description
`type`	Optional. The only valid value is the default: `redshift`.
`host`	Required. Host name of Redshift cluster.
`port`	Required. Port of Redshift cluster.
`database`	Required. Redshift database which the data will be loaded to.
`roleArn`	Required if 'NoCreds' is chosen as load auth method.. AWS Role ARN allowing Redshift to load data from S3.
`schema`	Required. Redshift schema name, eg “atomic”.
`username`	Required. DB user with permissions to load data.
`password`	Required. Password of DB user.
`maxError`	Optional. Configures the Redshift MAXERROR load option. The default is 10.
`loadAuthMethod.*` (since 5.2.0)	Optional, default method is `NoCreds`. Specifies the auth method to use with the `COPY` statement.
`loadAuthMethod.type`	Required if `loadAuthMethod` section is included. Specifies the type of the authentication method. The possible values are `NoCreds` and `TempCreds`. With `NoCreds`, no credentials will be passed to the `COPY` statement. Instead, Redshift cluster needs to be configured with an AWS Role ARN that allows it to load data from S3. This Role ARN needs to be passed in the `roleArn` setting above. You can find more information here. With 'TempCreds', temporary credentials will be created for every load operation and these temporary credentials will be passed to the `COPY` statement.
`loadAuthMethod.roleArn`	Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to, with the following permissions: `s3:GetObject*`, `s3:ListBucket`, and `s3:GetBucketLocation`.
`loadAuthMethod.credentialsTtl` since (5.4.0)	Optional, default value `1 hour`. If `TempCreds` load auth method is used, this value will be used as a session duration of temporary credentials used for loading data and folder monitoring. In that case, it can't be greater than 1 hour and can't be less than 15 minutes.
`jdbc.*`	Optional. Custom JDBC configuration. The default value is `{"ssl": true}`.
`jdbc.BlockingRowsMode`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.DisableIsValidQuery`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.DSILogLevel`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.FilterLevel`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.loginTimeout`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.loglevel`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.socketTimeout`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.ssl`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.sslMode`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.sslRootCert`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.tcpKeepAlive`	Optional. Refer to the Redshift JDBC driver reference.
`jdbc.TCPKeepAliveMinutes`	Optional. Refer to the Redshift JDBC driver reference.

Snowflake Loader `storage` section

Parameter	Description
`type`	Optional. The only valid value is the default: `snowflake`.
`snowflakeRegion`	Required. AWS Region used by Snowflake to access its endpoint.
`username`	Required. Snowflake user with necessary role granted to load data.
`role`	Optional. Snowflake role with permission to load data. If it is not provided, the default role in Snowflake will be used.
`password`	Required. Password of the Snowflake user. Can be plain text, read from the EC2 parameter store or GCP secret manager (see below).
`password.secretStore.parameterName`	Alternative way for passing in the user password.
`account`	Required. Target Snowflake account.
`warehouse`	Required. Snowflake warehouse which the SQL statements submitted by Snowflake Loader will run on.
`database`	Required. Snowflake database which the data will be loaded to.
`schema`	Required. Target schema
`transformedStage.*`	Required if `NoCreds` is chosen as load auth method. Snowflake stage for transformed events.
`transformedStage.name`	Required if `transformedStage` is included. The name of the stage.
`transformedStage.location`	Required if `transformedStage` is included. The S3 path used as stage location. (Not needed since 5.2.0 because it is auto-configured)
`folderMonitoringStage.*`	Required if `monitoring.folders` section is configured and `NoCreds` is chosen as load auth method. Snowflake stage to load folder monitoring entries into temporary Snowflake table.
`folderMonitoringStage.name`	Required if `folderMonitoringStage` is included. The name of the stage.
`folderMonitoringStage.location`	Required if `folderMonitoringStage` is included. The S3 path used as stage location. (Not needed since 5.2.0 because it is auto-configured)
`appName`	Optional. Name passed as 'application' property while creating Snowflake connection. The default is `Snowplow_OSS`.
`maxError`	Optional. A table copy statement will skip an input file when the number of errors in it exceeds the specified number. This setting is used during initial loading and thus can filter out only invalid JSONs (which is impossible situation if used with Transformer).
`jdbcHost`	Optional. Host for the JDBC driver that has priority over automatically derived hosts. If it is not given, host will be created automatically according to given `snowflakeRegion`.
`loadAuthMethod.*`	Optional, default method is `NoCreds`. Specifies the auth method to use with `COPY INTO` statement. Note that `TempCreds` auth method doesn't work when data is loaded from GCS.
`loadAuthMethod.type`	Required if `loadAuthMethod` section is included. Specifies the type of the auth method. The possible values are `NoCreds` and `TempCreds`. With `NoCreds`, no credentials will be passed to `COPY INTO` statement. Instead, `transformedStage` and `folderMonitoringStage` specified above will be used. More information can be found here. With `TempCreds`, temporary credentials will be created for every load operation and these temporary credentials will be passed to `COPY INTO` statement.
`loadAuthMethod.roleArn`	Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to. You can find the list of necessary permissions needs to be given to role in here.
`loadAuthMethod.credentialsTtl` since (5.4.0)	Optional, default value `1 hour`. If `TempCreds` load auth method is used, this value will be used as a session duration of temporary credentials used for loading data and folder monitoring. In that case, it can't be greater than 1 hour and can't be less than 15 minutes.
`readyCheck` since (5.4.0)	Optional. Either `ResumeWarehouse` (the default) or `Select1`. The command the loader runs to prepare the JDBC connection.

Databricks Loader `storage` section

Parameter	Description
`type`	Optional. The only valid value is the default: `databricks`.
`host`	Required. Hostname of Databricks cluster.
`password`	Required if `oauth` section is not defined. Databricks access token. Can be plain text, read from the EC2 parameter store or GCP secret manager (see below).
`oauth.clientId`	Required if `password` is not defined. Client id used in Databricks OAuth machine-to-machine authentication flow.
`oauth.clientSecret`	Required if `password` is not defined. Client secret used in Databricks OAuth machine-to-machine authentication flow.
`eventsOptimizePeriod`	Optional. The default value is `2 days`. Optimize period per table, that will be used as predicate for the `OPTIMIZE` command.
`password.secretManager.parameterName`	Alternative way for passing in the access token.
`schema`	Required. Target schema.
`port`	Required. Port of Databricks cluster.
`httpPath`	Required. Http Path of Databricks cluster. Get it from the JDBC connection details after the cluster has been created.
`catalog`	Optional. The default value is `hive_metastore`. Databricks unity catalog name.
`userAgent`	Optional. The default value is `snowplow-rdbloader-oss`. User agent name for Databricks connection.
`loadAuthMethod.*`	Optional, default method is `NoCreds`. Specifies the auth method to use with `COPY INTO` statement
`loadAuthMethod.type`	Required if `loadAuthMethod` section is included. Specifies the type of the auth method. The possible values are `NoCreds` and `TempCreds`. With `NoCreds`, no credentials will be passed to `COPY INTO` statement. Databricks cluster needs to have permission to access transformer output S3 bucket. More information can be found here. With `TempCreds`, temporary credentials will be created for every load operation and these temporary credentials will be passed to `COPY INTO` statement. With this way, Databricks cluster doesn't need permission to access to transformer output S3 bucket. This access will be provided by temporary credentials.
`loadAuthMethod.roleArn`	Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to, with the following permissions: `s3:GetObject*`, `s3:ListBucket`, and `s3:GetBucketLocation`.
`loadAuthMethod.credentialsTtl` since (5.4.0)	Optional, default value `1 hour`. If `TempCreds` load auth method is used, this value will be used as a session duration of temporary credentials used for loading data and folder monitoring. In that case, it can't be greater than 1 hour and can't be less than 15 minutes.
`logLevel` since (5.3.2)	Optional. The default value is 3. Specifies JDBC driver log level. 0 - Disable all logging. 1 - Log severe error events that lead the driver to abort. 2 - Log error events that might allow the driver to continue running. 3 - Log events that might result in an error if action is not taken. (default) 4 - Log general information that describes the progress of the driver. 5 - Log detailed information that is useful for debugging the driver. 6 - Log all driver activity.

AWS specific settings

Parameter	Description
`region`	Optional if it can be resolved with AWS region provider chain. AWS region of the S3 bucket.
`messageQueue.type`	Required. Type of the message queue. It should be `sqs` when application is run on AWS.
`messageQueue.queueName`	Required. The name of the SQS queue used by the transformer and loader to communicate.
`jsonpaths`	Optional. An S3 URI that holds JSONPath files.

GCP specific settings

Only Snowflake Loader can be run on GCP at the moment.

Parameter	Description
`messageQueue.type`	Type of the message queue. It should be `pubsub` when application is run on GCP.
`messageQueue.subscription`	Required. The name of the Pubsub subscription used by the transformer and loader to communicate.

Azure specific settings

Only Snowflake Loader can be run on Azure at the moment.

Parameter	Description
`blobStorageEndpoint`	Endpoint of Azure Blob Storage container that contains transformer's output.
`azureVaultName`	Name of the Azure Key Vault where application secrets are stored. Required if secret store is used in `storage.password` field.
`messageQueue.type`	Type of the message queue. It should be `kafka` when application is run on Azure.
`messageQueue.topicName`	Name of the Kafka topic used to communicate with Transformer.
`messageQueue.bootstrapServers`	A list of host:port pairs to use for establishing the initial connection to the Kafka cluster.
`messageQueue.consumerConf`	Optional. Kafka consumer configuration. See https://kafka.apache.org/documentation/#consumerconfigs for all properties.

Common loader settings

Parameter	Description
`schedules.*`	Optional. Tasks scheduled to run periodically.
`schedules.noOperation.[*]`	Optional. Array of objects which specifies no-operation windows during which periodically scheduled tasks (configured in this section) will not run.
`schedules.noOperation.[*].name`	Human-readable name of the no-op window.
`schedules.noOperation.[*].when`	Cron expression with second granularity.
`schedules.noOperation.[*].duration`	For how long the loader should be paused.
`schedules.optimizeEvents`	Optional. The default value is `"0 0 0 ? * *"` (i.e. every day at 00:00, JVM timezone). Cron expression with second granularity that specifies the schedule to run periodically an `OPTIMIZE` statement on event table. (Only for Databricks Loader)
`schedules.optimizeManifest`	Optional. The default value is `"0 0 5 ? * *"` (i.e. every day at 05:00 AM, JVM timezone). Cron expression with second granularity that specifies the schedule to run periodically an `OPTIMIZE` statement on manifest table. (Only for Databricks Loader)
`retryQueue.*`	Optional. Additional backlog of recently failed folders that could be automatically retried. Retry queue saves a failed folder and then re-reads the info from `shredding_complete` S3 file. (Despite the legacy name of the message, which is required for backward compatibility, this also works with wide row format data.)
`retryQueue.period`	Required if `retryQueue` section is configured. How often batch of failed folders should be pulled into a discovery queue.
`retryQueue.size`	Required if `retryQueue` section is configured. How many failures should be kept in memory. After the limit is reached new failures are dropped.
`retryQueue.maxAttempts`	Required if `retryQueue` section is configured. How many attempts to make for each folder. After the limit is reached new failures are dropped.
`retryQueue.interval`	Required if `retryQueue` section is configured. Artificial pause after each failed folder being added to the queue.
`retries.*`	Optional. Unlike `retryQueue` these retries happen immediately, without proceeding to another message.
`retries.backoff`	Required if `retries` section is configured. Starting backoff period, eg '30 seconds'.
`retries.strategy`	Backoff strategy used during retry. The possible values are `JITTER`, `CONSTANT`, `EXPONENTIAL`, `FIBONACCI`.
`retries.attempts`	Optional. How many attempts to make before sending the message into retry queue. If missing, `cumulativeBound` will be used.
`retries.cumulativeBound`	Optional. When backoff reaches this delay, eg '1 hour', the loader will stop retrying. If both this and `attempts` are not set, the loader will retry indefinitely.
`timeouts.loading`	Optional. How long, eg '1 hour', `COPY` statement execution can take before considering Redshift unhealthy. If no progress (ie, moving to a different subfolder) within this period, the loader will abort the transaction.
`timeouts.nonLoading`	Optional. How long, eg '10 mins', non-loading steps such as `ALTER TABLE` can take before considering Redshift unhealthy.
`timeouts.sqsVisibility`	Optional. The time window in which a message must be acknowledged. Otherwise it is considered abandoned. If a message has been pulled, but hasn't been acked, the time before it is again available to consumers is equal to this, eg '5 mins'. Another consequence is that if the loader has failed on processing a message, the next time it will get this (or anything) from the queue has this delay.
`readyCheck.*`	Optional. Check the target destination to make sure it is ready.
`readyCheck.backoff`	Optional. The default value is `15 seconds`. Starting backoff period.
`readyCheck.strategy`	Optional. The default value is `CONSTANT`. Backoff strategy used during retry. The possible values are `JITTER`, `CONSTANT`, `EXPONENTIAL`, `FIBONACCI`.
`initRetries.*`	Optional. Retries configuration for initialization block. It will retry on all exceptions from there.
`initRetries.backoff`	Required if `initRetries` section is configured. Starting backoff period, eg '30 seconds'.
`initRetries.strategy`	Backoff strategy used during retry. The possible values are `JITTER`, `CONSTANT`, `EXPONENTIAL`, `FIBONACCI`.
`initRetries.attempts`	Optional. How many attempts to make before sending the message into retry queue. If missing, `cumulativeBound` will be used.
`initRetries.cumulativeBound`	Optional. When backoff reaches this delay, eg '1 hour', the loader will stop retrying. If both this and `attempts` are not set, the loader will retry indefinitely.
`telemetry.disable`	Optional. Set to `true` to disable telemetry.
`telemetry.userProvidedId`	Optional. See here for more information.

Common monitoring settings

Parameter	Description
`monitoring.webhook.endpoint`	Optional. An HTTP endpoint where monitoring alerts should be sent.
`monitoring.webhook.tags`	Optional. Custom key-value pairs which can be added to the monitoring webhooks. Eg, `{"tag1": "label1"}`.
`monitoring.snowplow.appId`	Optional. When using Snowplow tracking, set this `appId` in the event.
`monitoring.snowplow.collector`	Optional. Set to a collector URL to turn on Snowplow tracking.
`monitoring.sentry.dsn`	Optional. For tracking runtime exceptions.
`monitoring.metrics.*`	Send metrics to a StatsD server or stdout.
`monitoring.metrics.period`	Optional. The default is 5 minutes. Period for metrics emitted periodically.
`monitoring.metrics.statsd.*`	Optional. For sending loading metrics (latency and event counts) to a StatsD server.
`monitoring.metrics.statsd.hostname`	Required if `monitoring.metrics.statsd` section is configured. The host name of the StatsD server.
`monitoring.metrics.statsd.port`	Required if `monitoring.metrics.statsd` section is configured. Port of the StatsD server.
`monitoring.metrics.statsd.tags`	Optional. Tags which are used to annotate the StatsD metric with any contextual information.
`monitoring.metrics.statsd.prefix`	Optional. Configures the prefix of StatsD metric names. The default is `snoplow.rdbloader`.
`monitoring.metrics.stdout.*`	Optional. For sending metrics to stdout.
`monitoring.metrics.stdout.prefix`	Optional. Overrides the default metric prefix.
`monitoring.folders.*`	Optional. Configuration for periodic unloaded / corrupted folders checks.
`monitoring.folders.staging`	Required if `monitoring.folders` section is configured. Path where loader could store auxiliary logs for folder monitoring. Loader should be able to write here, storage target should be able to load from here.
`monitoring.folders.period`	Required if `monitoring.folders` section is configured. How often to check for unloaded / corrupted folders.
`monitoring.folders.since`	Optional. Specifies from when folder monitoring will start to monitor. Note that this is a duration, eg `7 days`, relative to when the loader is launched.
`monitoring.folders.until`	Optional. Specifies until when folder monitoring will monitor. Note that this is a duration, eg `7 days`, relative to when the loader is launched.
`monitoring.folders.transformerOutput`	Required if `monitoring.folders` section is configured. Path to transformed archive.
`monitoring.folders.failBeforeAlarm`	Required if `monitoring.folders` section is configured. How many times the check can fail before generating an alarm. Within the specified tolerance, failures will log a `WARNING` instead.
`monitoring.healthCheck.*`	Optional. Periodic DB health check, raising a warning if DB hasn't responded to `SELECT 1`.
`monitoring.healthCheck.frequency`	Required if `monitoring.healthCheck` section is configured. How often to run a periodic DB health check.
`monitoring.healthCheck.timeout`	Required if `monitoring.healthCheck` section is configured. How long to wait for a health check response.

License​

Redshift Loader storage section​

Snowflake Loader storage section​

Databricks Loader storage section​

AWS specific settings​

GCP specific settings​

Azure specific settings​

Common loader settings​

Common monitoring settings​