RDB Loader configuration reference
The configuration reference in this page is written for RDB Loader 5.0.0 or higher.
The configuration reference pages for previous versions can be found here.
All applications use a common module for core functionality, so only the storage
sections are different in their config.
Licenseβ
Since version 6.0.0, RDB Loader is released under the Snowplow Limited Use License (FAQ).
To accept the terms of license and run RDB Loader, set the ACCEPT_LIMITED_USE_LICENSE=yes
environment variable. Alternatively, you can configure the license.accept
option, like this:
license {
accept = true
}
Redshift Loader storage
sectionβ
Parameter | Description |
---|---|
type | Optional. The only valid value is the default: redshift . |
host | Required. Host name of Redshift cluster. |
port | Required. Port of Redshift cluster. |
database | Required. Redshift database which the data will be loaded to. |
roleArn | Required if 'NoCreds' is chosen as load auth method.. AWS Role ARN allowing Redshift to load data from S3. |
schema | Required. Redshift schema name, eg βatomicβ. |
username | Required. DB user with permissions to load data. |
password | Required. Password of DB user. |
maxError | Optional. Configures the Redshift MAXERROR load option. The default is 10. |
loadAuthMethod.* (since 5.2.0) | Optional, default method is NoCreds . Specifies the auth method to use with the COPY statement. |
loadAuthMethod.type | Required if loadAuthMethod section is included. Specifies the type of the authentication method. The possible values are NoCreds and TempCreds . With NoCreds , no credentials will be passed to the COPY statement. Instead, Redshift cluster needs to be configured with an AWS Role ARN that allows it to load data from S3. This Role ARN needs to be passed in the roleArn setting above. You can find more information here. With 'TempCreds', temporary credentials will be created for every load operation and these temporary credentials will be passed to the COPY statement. |
loadAuthMethod.roleArn | Required if loadAuthMethod.type is TempCreds . IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to, with the following permissions: s3:GetObject* , s3:ListBucket , and s3:GetBucketLocation . |
loadAuthMethod.credentialsTtl since (5.4.0) | Optional, default value 1 hour . If TempCreds load auth method is used, this value will be used as a session duration of temporary credentials used for loading data and folder monitoring. In that case, it can't be greater than 1 hour and can't be less than 15 minutes. |
jdbc.* | Optional. Custom JDBC configuration. The default value is {"ssl": true} . |
jdbc.BlockingRowsMode | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.DisableIsValidQuery | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.DSILogLevel | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.FilterLevel | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.loginTimeout | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.loglevel | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.socketTimeout | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.ssl | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.sslMode | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.sslRootCert | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.tcpKeepAlive | Optional. Refer to the Redshift JDBC driver reference. |
jdbc.TCPKeepAliveMinutes | Optional. Refer to the Redshift JDBC driver reference. |
Snowflake Loader storage
sectionβ
Parameter | Description |
---|---|
type | Optional. The only valid value is the default: snowflake . |
snowflakeRegion | Required. AWS Region used by Snowflake to access its endpoint. |
username | Required. Snowflake user with necessary role granted to load data. |
role | Optional. Snowflake role with permission to load data. If it is not provided, the default role in Snowflake will be used. |
password | Required. Password of the Snowflake user. Can be plain text, read from the EC2 parameter store or GCP secret manager (see below). |
password.secretStore.parameterName | Alternative way for passing in the user password. |
account | Required. Target Snowflake account. |
warehouse | Required. Snowflake warehouse which the SQL statements submitted by Snowflake Loader will run on. |
database | Required. Snowflake database which the data will be loaded to. |
schema | Required. Target schema |
transformedStage.* | Required if NoCreds is chosen as load auth method. Snowflake stage for transformed events. |
transformedStage.name | Required if transformedStage is included. The name of the stage. |
transformedStage.location | Required if transformedStage is included. The S3 path used as stage location. (Not needed since 5.2.0 because it is auto-configured) |
folderMonitoringStage.* | Required if monitoring.folders section is configured and NoCreds is chosen as load auth method. Snowflake stage to load folder monitoring entries into temporary Snowflake table. |
folderMonitoringStage.name | Required if folderMonitoringStage is included. The name of the stage. |
folderMonitoringStage.location | Required if folderMonitoringStage is included. The S3 path used as stage location. (Not needed since 5.2.0 because it is auto-configured) |
appName | Optional. Name passed as 'application' property while creating Snowflake connection. The default is Snowplow_OSS . |
maxError | Optional. A table copy statement will skip an input file when the number of errors in it exceeds the specified number. This setting is used during initial loading and thus can filter out only invalid JSONs (which is impossible situation if used with Transformer). |
jdbcHost | Optional. Host for the JDBC driver that has priority over automatically derived hosts. If it is not given, host will be created automatically according to given snowflakeRegion . |
loadAuthMethod.* | Optional, default method is NoCreds . Specifies the auth method to use with COPY INTO statement. Note that TempCreds auth method doesn't work when data is loaded from GCS. |
loadAuthMethod.type | Required if loadAuthMethod section is included. Specifies the type of the auth method. The possible values are NoCreds and TempCreds . With NoCreds , no credentials will be passed to COPY INTO statement. Instead, transformedStage and folderMonitoringStage specified above will be used. More information can be found here. With TempCreds , temporary credentials will be created for every load operation and these temporary credentials will be passed to COPY INTO statement. |
loadAuthMethod.roleArn | Required if loadAuthMethod.type is TempCreds . IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to. You can find the list of necessary permissions needs to be given to role in here. |
loadAuthMethod.credentialsTtl since (5.4.0) | Optional, default value 1 hour . If TempCreds load auth method is used, this value will be used as a session duration of temporary credentials used for loading data and folder monitoring. In that case, it can't be greater than 1 hour and can't be less than 15 minutes. |
readyCheck since (5.4.0) | Optional. Either ResumeWarehouse (the default) or Select1 . The command the loader runs to prepare the JDBC connection. |
Databricks Loader storage
sectionβ
Parameter | Description |
---|---|
type | Optional. The only valid value is the default: databricks . |
host | Required. Hostname of Databricks cluster. |
password | Required if oauth section is not defined. Databricks access token. Can be plain text, read from the EC2 parameter store or GCP secret manager (see below). |
oauth.clientId | Required if password is not defined. Client id used in Databricks OAuth machine-to-machine authentication flow. |
oauth.clientSecret | Required if password is not defined. Client secret used in Databricks OAuth machine-to-machine authentication flow. |
eventsOptimizePeriod | Optional. The default value is 2 days . Optimize period per table, that will be used as predicate for the OPTIMIZE command. |
password.secretManager.parameterName | Alternative way for passing in the access token. |
schema | Required. Target schema. |
port | Required. Port of Databricks cluster. |
httpPath | Required. Http Path of Databricks cluster. Get it from the JDBC connection details after the cluster has been created. |
catalog | Optional. The default value is hive_metastore . Databricks unity catalog name. |
userAgent | Optional. The default value is snowplow-rdbloader-oss . User agent name for Databricks connection. |
loadAuthMethod.* | Optional, default method is NoCreds . Specifies the auth method to use with COPY INTO statement |
loadAuthMethod.type | Required if loadAuthMethod section is included. Specifies the type of the auth method. The possible values are NoCreds and TempCreds . With NoCreds , no credentials will be passed to COPY INTO statement. Databricks cluster needs to have permission to access transformer output S3 bucket. More information can be found here. With TempCreds , temporary credentials will be created for every load operation and these temporary credentials will be passed to COPY INTO statement. With this way, Databricks cluster doesn't need permission to access to transformer output S3 bucket. This access will be provided by temporary credentials. |
loadAuthMethod.roleArn | Required if loadAuthMethod.type is TempCreds . IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to, with the following permissions: s3:GetObject* , s3:ListBucket , and s3:GetBucketLocation . |
loadAuthMethod.credentialsTtl since (5.4.0) | Optional, default value 1 hour . If TempCreds load auth method is used, this value will be used as a session duration of temporary credentials used for loading data and folder monitoring. In that case, it can't be greater than 1 hour and can't be less than 15 minutes. |
logLevel since (5.3.2) | Optional. The default value is 3. Specifies JDBC driver log level. 0 - Disable all logging. 1 - Log severe error events that lead the driver to abort. 2 - Log error events that might allow the driver to continue running. 3 - Log events that might result in an error if action is not taken. (default) 4 - Log general information that describes the progress of the driver. 5 - Log detailed information that is useful for debugging the driver. 6 - Log all driver activity. |
AWS specific settingsβ
Parameter | Description |
---|---|
region | Optional if it can be resolved with AWS region provider chain. AWS region of the S3 bucket. |
messageQueue.type | Required. Type of the message queue. It should be sqs when application is run on AWS. |
messageQueue.queueName | Required. The name of the SQS queue used by the transformer and loader to communicate. |
jsonpaths | Optional. An S3 URI that holds JSONPath files. |
GCP specific settingsβ
Only Snowflake Loader can be run on GCP at the moment.
Parameter | Description |
---|---|
messageQueue.type | Type of the message queue. It should be pubsub when application is run on GCP. |
messageQueue.subscription | Required. The name of the Pubsub subscription used by the transformer and loader to communicate. |
Azure specific settingsβ
Only Snowflake Loader can be run on Azure at the moment.
Parameter | Description |
---|---|
blobStorageEndpoint | Endpoint of Azure Blob Storage container that contains transformer's output. |
azureVaultName | Name of the Azure Key Vault where application secrets are stored. Required if secret store is used in storage.password field. |
messageQueue.type | Type of the message queue. It should be kafka when application is run on Azure. |
messageQueue.topicName | Name of the Kafka topic used to communicate with Transformer. |
messageQueue.bootstrapServers | A list of host:port pairs to use for establishing the initial connection to the Kafka cluster. |
messageQueue.consumerConf | Optional. Kafka consumer configuration. See https://kafka.apache.org/documentation/#consumerconfigs for all properties. |
Common loader settingsβ
Parameter | Description |
---|---|
schedules.* | Optional. Tasks scheduled to run periodically. |
schedules.noOperation.[*] | Optional. Array of objects which specifies no-operation windows during which periodically scheduled tasks (configured in this section) will not run. |
schedules.noOperation.[*].name | Human-readable name of the no-op window. |
schedules.noOperation.[*].when | Cron expression with second granularity. |
schedules.noOperation.[*].duration | For how long the loader should be paused. |
schedules.optimizeEvents | Optional. The default value is "0 0 0 ? * *" (i.e. every day at 00:00, JVM timezone). Cron expression with second granularity that specifies the schedule to run periodically an OPTIMIZE statement on event table. (Only for Databricks Loader) |
schedules.optimizeManifest | Optional. The default value is "0 0 5 ? * *" (i.e. every day at 05:00 AM, JVM timezone). Cron expression with second granularity that specifies the schedule to run periodically an OPTIMIZE statement on manifest table. (Only for Databricks Loader) |
retryQueue.* | Optional. Additional backlog of recently failed folders that could be automatically retried. Retry queue saves a failed folder and then re-reads the info from shredding_complete S3 file. (Despite the legacy name of the message, which is required for backward compatibility, this also works with wide row format data.) |
retryQueue.period | Required if retryQueue section is configured. How often batch of failed folders should be pulled into a discovery queue. |
retryQueue.size | Required if retryQueue section is configured. How many failures should be kept in memory. After the limit is reached new failures are dropped. |
retryQueue.maxAttempts | Required if retryQueue section is configured. How many attempts to make for each folder. After the limit is reached new failures are dropped. |
retryQueue.interval | Required if retryQueue section is configured. Artificial pause after each failed folder being added to the queue. |
retries.* | Optional. Unlike retryQueue these retries happen immediately, without proceeding to another message. |
retries.backoff | Required if retries section is configured. Starting backoff period, eg '30 seconds'. |
retries.strategy | Backoff strategy used during retry. The possible values are JITTER , CONSTANT , EXPONENTIAL , FIBONACCI . |
retries.attempts | Optional. How many attempts to make before sending the message into retry queue. If missing, cumulativeBound will be used. |
retries.cumulativeBound | Optional. When backoff reaches this delay, eg '1 hour', the loader will stop retrying. If both this and attempts are not set, the loader will retry indefinitely. |
timeouts.loading | Optional. How long, eg '1 hour', COPY statement execution can take before considering Redshift unhealthy. If no progress (ie, moving to a different subfolder) within this period, the loader will abort the transaction. |
timeouts.nonLoading | Optional. How long, eg '10 mins', non-loading steps such as ALTER TABLE can take before considering Redshift unhealthy. |
timeouts.sqsVisibility | Optional. The time window in which a message must be acknowledged. Otherwise it is considered abandoned. If a message has been pulled, but hasn't been acked, the time before it is again available to consumers is equal to this, eg '5 mins'. Another consequence is that if the loader has failed on processing a message, the next time it will get this (or anything) from the queue has this delay. |
readyCheck.* | Optional. Check the target destination to make sure it is ready. |
readyCheck.backoff | Optional. The default value is 15 seconds . Starting backoff period. |
readyCheck.strategy | Optional. The default value is CONSTANT . Backoff strategy used during retry. The possible values are JITTER , CONSTANT , EXPONENTIAL , FIBONACCI . |
initRetries.* | Optional. Retries configuration for initialization block. It will retry on all exceptions from there. |
initRetries.backoff | Required if initRetries section is configured. Starting backoff period, eg '30 seconds'. |
initRetries.strategy | Backoff strategy used during retry. The possible values are JITTER , CONSTANT , EXPONENTIAL , FIBONACCI . |
initRetries.attempts | Optional. How many attempts to make before sending the message into retry queue. If missing, cumulativeBound will be used. |
initRetries.cumulativeBound | Optional. When backoff reaches this delay, eg '1 hour', the loader will stop retrying. If both this and attempts are not set, the loader will retry indefinitely. |
telemetry.disable | Optional. Set to true to disable telemetry. |
telemetry.userProvidedId | Optional. See here for more information. |
Common monitoring settingsβ
Parameter | Description |
---|---|
monitoring.webhook.endpoint | Optional. An HTTP endpoint where monitoring alerts should be sent. |
monitoring.webhook.tags | Optional. Custom key-value pairs which can be added to the monitoring webhooks. Eg, {"tag1": "label1"} . |
monitoring.snowplow.appId | Optional. When using Snowplow tracking, set this appId in the event. |
monitoring.snowplow.collector | Optional. Set to a collector URL to turn on Snowplow tracking. |
monitoring.sentry.dsn | Optional. For tracking runtime exceptions. |
monitoring.metrics.* | Send metrics to a StatsD server or stdout. |
monitoring.metrics.period | Optional. The default is 5 minutes. Period for metrics emitted periodically. |
monitoring.metrics.statsd.* | Optional. For sending loading metrics (latency and event counts) to a StatsD server. |
monitoring.metrics.statsd.hostname | Required if monitoring.metrics.statsd section is configured. The host name of the StatsD server. |
monitoring.metrics.statsd.port | Required if monitoring.metrics.statsd section is configured. Port of the StatsD server. |
monitoring.metrics.statsd.tags | Optional. Tags which are used to annotate the StatsD metric with any contextual information. |
monitoring.metrics.statsd.prefix | Optional. Configures the prefix of StatsD metric names. The default is snoplow.rdbloader . |
monitoring.metrics.stdout.* | Optional. For sending metrics to stdout. |
monitoring.metrics.stdout.prefix | Optional. Overrides the default metric prefix. |
monitoring.folders.* | Optional. Configuration for periodic unloaded / corrupted folders checks. |
monitoring.folders.staging | Required if monitoring.folders section is configured. Path where loader could store auxiliary logs for folder monitoring. Loader should be able to write here, storage target should be able to load from here. |
monitoring.folders.period | Required if monitoring.folders section is configured. How often to check for unloaded / corrupted folders. |
monitoring.folders.since | Optional. Specifies from when folder monitoring will start to monitor. Note that this is a duration, eg 7 days , relative to when the loader is launched. |
monitoring.folders.until | Optional. Specifies until when folder monitoring will monitor. Note that this is a duration, eg 7 days , relative to when the loader is launched. |
monitoring.folders.transformerOutput | Required if monitoring.folders section is configured. Path to transformed archive. |
monitoring.folders.failBeforeAlarm | Required if monitoring.folders section is configured. How many times the check can fail before generating an alarm. Within the specified tolerance, failures will log a WARNING instead. |
monitoring.healthCheck.* | Optional. Periodic DB health check, raising a warning if DB hasn't responded to SELECT 1 . |
monitoring.healthCheck.frequency | Required if monitoring.healthCheck section is configured. How often to run a periodic DB health check. |
monitoring.healthCheck.timeout | Required if monitoring.healthCheck section is configured. How long to wait for a health check response. |