Skip to main content

RDB Loader 3.0.x

caution
You are reading documentation for an outdated version. Here’s the latest one!

An example of the minimal required config for the Redshift loader can be found here and a more detailed one here.

An example of the minimal required config for the Snowflake loader can be found here and a more detailed one here.

Both applications use a common module for core functionality, so only the storage sections are different in their config.

This is a complete list of the options that can be configured:

Redshift Loader storage section​

typeOptional. The only valid value is the default: redshift.
hostRequired. Host name of Redshift cluster.
portRequired. Port of Redshift cluster.
databaseRequired. Redshift database which the data will be loaded to.
roleArnRequired. AWS Role ARN allowing Redshift to load data from S3.
schemaRequired. Redshift schema name, eg β€œatomic”.
usernameRequired. DB user with permissions to load data.
passwordRequired. Password of DB user.
maxErrorOptional. Configures the Redshift MAXERROR load option. The default is 10.
jdbc.*Optional. Custom JDBC configuration. The default value is {"ssl": true}.
jdbc.BlockingRowsModeOptional. Refer to the Redshift JDBC driver reference.
jdbc.DisableIsValidQueryOptional. Refer to the Redshift JDBC driver reference.
jdbc.DSILogLevelOptional. Refer to the Redshift JDBC driver reference.
jdbc.FilterLevelOptional. Refer to the Redshift JDBC driver reference.
jdbc.loginTimeoutOptional. Refer to the Redshift JDBC driver reference.
jdbc.loglevelOptional. Refer to the Redshift JDBC driver reference.
jdbc.socketTimeoutOptional. Refer to the Redshift JDBC driver reference.
jdbc.sslOptional. Refer to the Redshift JDBC driver reference.
jdbc.sslModeOptional. Refer to the Redshift JDBC driver reference.
jdbc.sslRootCertOptional. Refer to the Redshift JDBC driver reference.
jdbc.tcpKeepAliveOptional. Refer to the Redshift JDBC driver reference.
jdbc.TCPKeepAliveMinutesOptional. Refer to the Redshift JDBC driver reference.

Snowflake Loader storage section​

typeOptional. The only valid value is the default: snowflake.
snowflakeRegionRequired. AWS Region used by Snowflake to access its endpoint.
usernameRequired. Snowflake user with necessary role granted to load data.
roleOptional. Snowflake role with permission to load data. If it is not provided, the default role in Snowflake will be used.
passwordRequired. Password of the Snowflake user. Can be plain text, or read from the EC2 parameter store (see below).
password.ec2ParameterStore.parameterNameAlternative way for passing in the user password.
accountRequired. Target Snowflake account.
warehouseRequired. Snowflake warehouse which the SQL statements submitted by Snowflake Loader will run on.
databaseRequired. Snowflake database which the data will be loaded to.
schemaRequired. Target schema
transformedStageRequired. Snowflake stage for transformed events.
folderMonitoringStageRequired if monitoring.folders section is configured. Snowflake stage to load folder monitoring entries into temporary Snowflake table.
appNameOptional. Name passed as 'application' property while creating Snowflake connection. The default is Snowplow_OSS.
maxErrorOptional. A table copy statement will skip an input file when the number of errors in it exceeds the specified number. This setting is used during initial loading and thus can filter out only invalid JSONs (which is impossible situation if used with Transformer).
jdbcHostOptional. Host for the JDBC driver that has priority over automatically derived hosts. If it is not given, host will be created automatically according to given snowflakeRegion.

Common loader settings​

regionOptional if it can be resolved with AWS region provider chain. AWS region of the S3 bucket.
messageQueueRequired. The name of the SQS queue used by the transformer and loader to communicate.
jsonpathsOptional. An S3 URI that holds JSONPath files.
schedules.*Optional. Periodic schedules to stop loading, eg for Redshift maintenance window.
schedules.noOperation.[*]Required if schedules section is configured. Array of objects which specifies no-operation windows.
schedules.noOperation.[*].nameHuman-readable name of the no-op window.
schedules.noOperation.[*].whenCron expression with second granularity.
schedules.noOperation.[*].durationFor how long the loader should be paused.
retryQueue.*Optional. Additional backlog of recently failed folders that could be automatically retried. Retry queue saves a failed folder and then re-reads the info from shredding_complete S3 file. (Despite the legacy name of the message, which is required for backward compatibility, this also works with wide row format data.)
retryQueue.periodRequired if retryQueue section is configured. How often batch of failed folders should be pulled into a discovery queue.
retryQueue.sizeRequired if retryQueue section is configured. How many failures should be kept in memory. After the limit is reached new failures are dropped.
retryQueue.maxAttemptsRequired if retryQueue section is configured. How many attempts to make for each folder. After the limit is reached new failures are dropped.
retryQueue.intervalRequired if retryQueue section is configured. Artificial pause after each failed folder being added to the queue.
retries.*Optional. Unlike retryQueue these retries happen immediately, without proceeding to another message.
retries.backoffRequired if retries section is configured. Starting backoff period, eg '30 seconds'.
retries.strategyThe only possible value is EXPONENTIAL
retries.attemptsOptional. How many attempts to make before sending the message into retry queue. If missing, cumulativeBound will be used.
retries.cumulativeBoundOptional. When backoff reaches this delay, eg '1 hour', the loader will stop retrying. If both this and attempts are not set, the loader will retry indefinitely.
timeouts.loadingOptional. How long, eg '1 hour', COPY statement execution can take before considering Redshift unhealthy. If no progress (ie, moving to a different subfolder) within this period, the loader will abort the transaction.
timeouts.nonLoadingOptional. How long, eg '10 mins', non-loading steps such as ALTER TABLE can take before considering Redshift unhealthy.
timeouts.sqsVisibilityOptional. The time window in which a message must be acknowledged. Otherwise it is considered abandoned. If a message has been pulled, but hasn't been acked, the time before it is again available to consumers is equal to this, eg '5 mins'. Another consequence is that if the loader has failed on processing a message, the next time it will get this (or anything) from the queue has this delay.

Common monitoring settings​

monitoring.webhook.endpointOptional. An HTTP endpoint where monitoring alerts should be sent.
monitoring.webhook.tagsOptional. Custom key-value pairs which can be added to the monitoring webhooks. Eg, {"tag1": "label1"}.
monitoring.snowplow.appIdOptional. When using Snowplow tracking, set this appId in the event.
monitoring.snowplow.collectorOptional. Set to a collector URL to turn on Snowplow tracking.
monitoring.sentry.dsnOptional. For tracking runtime exceptions.
monitoring.metrics.*Send metrics to a StatsD server or stdout.
monitoring.metrics.statsd.*Optional. For sending loading metrics (latency and event counts) to a StatsD server.
monitoring.metrics.statsd.hostnameRequired if monitoring.metrics.statsd section is configured. The host name of the StatsD server.
monitoring.metrics.statsd.portRequired if monitoring.metrics.statsd section is configured. Port of the StatsD server.
monitoring.metrics.statsd.tagsOptional. Tags which are used to annotate the StatsD metric with any contextual information.
monitoring.metrics.statsd.prefixOptional. Configures the prefix of StatsD metric names. The default is snoplow.rdbloader.
monitoring.metrics.stdout.*Optional. For sending metrics to stdout.
monitoring.metrics.stdout.prefixOptional. Overrides the default metric prefix.
monitoring.folders.*Optional. Configuration for periodic unloaded / corrupted folders checks.
monitoring.folders.stagingRequired if monitoring.folders section is configured. Path where loader could store auxiliary logs for folder monitoring. Loader should be able to write here, storage target should be able to load from here.
monitoring.folders.periodRequired if monitoring.folders section is configured. How often to check for unloaded / corrupted folders.
monitoring.folders.sinceOptional. Specifies from when folder monitoring will start to monitor. Note that this is a duration, eg 7 days, relative to when the loader is launched.
monitoring.folders.untilOptional. Specifies until when folder monitoring will monitor. Note that this is a duration, eg 7 days, relative to when the loader is launched.
monitoring.folders.transformerOutputRequired if monitoring.folders section is configured. Path to transformed archive.
monitoring.folders.failBeforeAlarmRequired if monitoring.folders section is configured. How many times the check can fail before generating an alarm. Within the specified tolerance, failures will log a WARNING instead.
monitoring.healthCheck.*Optional. Periodic DB health check, raising a warning if DB hasn't responded to SELECT 1.
monitoring.healthCheck.frequencyRequired if monitoring.healthCheck section is configured. How often to run a periodic DB health check.
monitoring.healthCheck.timeoutRequired if monitoring.healthCheck section is configured. How long to wait for a health check response.