Skip to main content

Configuration reference

This is a complete list of the options that can be configured in the Snowplow BigQuery Loader HOCON config file. The example configs in github show how to prepare a config file.

Required options

projectIdRequired. The GCP project in which all required Pub/Sub, BigQuery and GCS resources are hosted, eg my-project.
loader.input.subscriptionRequired. Enriched events subscription consumed by Loader and StreamLoader, eg enriched-sub.
loader.output.good.datasetIdRequired. Specify the dataset to which the events table belongs, eg snowplow.
loader.output.good.tableIdRequired. The name of the events table, eg events.
loader.output.bad.topicRequired. The name of the topic where bad rows will be written, eg bad-topic.
loader.output.types.topicRequired. The name of the topic where observed types will be written, eg types-topic.
loader.output.failedInserts.topicRequired. The name of the topic where failed inserts will be written, eg failed-inserts-topic.
mutator.input.subscriptionRequired. A subscription on the loader.output.types.topic, eg types-sub.
mutator.output.good.*Required. Equivalent to loader.output.good.*. Can be specified in detail or as ${loader.output.good}.
repeater.input.subscriptionRequired. Failed inserts subscription consumed by Repeater. Must be attached to the loader.output.failedInserts.topic, eg failed-inserts-sub.
repeater.output.good.*Required. Equivalent to loader.output.good.*. Can be specified in detail or as ${loader.output.good}.
repeater.output.deadLetters.bucketRequired. Failed inserts that repeatedly fail to be inserted into BigQuery are stored on GCS in this bucket, eg gs://dead-letter-bucket.
monitoring.*Optional. See below for details.Note: This was a required setting in 1.0.0. Can be left blank, ie {}, to disable this functionality in that version.

Monitoring options

monitoring.statsd.*Optional. If set up, metrics will be emitted from StreamLoader and Repeater using the StatsD protocol.
monitoring.statsd.hostnameOptional, eg
monitoring.statsd.portOptional, eg 1024.
monitoring.statsd.tagsOptional. You can use env vars, eg {"worker": ${HOST}}.
monitoring.statsd.periodOptional, eg 10 sec.
monitoring.statsd.prefixOptional, eg snowplow.monitoring.
monitoring.dropwizard.* Optional. If set up, metrics will be emitted from Loader using the Dropwizard protocol.
monitoring.dropwizard.periodOptional, eg 10000 ms.
monitoring.stdout.*Optional. If set up, metrics will be logged to stdout at INFO level.
monitoring.sentryOptional. If set up, errors will be sent to a Sentry endpoint.

Advanced options

The defaults should be good for the overwhelming majority of deployments and hopefully you should never need to change these.

loader.loadMode.*BigQuery supports two loading APIs:
- Streaming Inserts API
- Load Jobs API (experimental)
loader.loadMode.typeDefaults to StreamingInserts. The only other possible option is FileLoads.
loader.loadMode.retryDefaults to false. Specifies if failed inserts should be retried infinitely or sent straight to the failedInserts topic. When set to true, if a row cannot be inserted, it will be re-tried indefinitely, which can throttle the whole load. In that case a restart might be required. This setting is only supported by the Streaming inserts API.
loader.loadMode.frequencyDefaults to null. Specifies how often the load job should be performed, in seconds. Unlike the near-real-time Streaming inserts API, load jobs are more batch-oriented. This setting is only supported by the Load jobs API. An example value is 60000.
loader.consumerSettings.maxQueueSizeDefaults to 3000. The maximum number of unacked messages that stream loader can hold in memory at once
loader.consumerSettings.parallelPullCountDefaults to 3. The number of pullers used to pull messages from the input subscription
loader.consumerSettings.maxRequestBytesDefaults to 50000000. The maximum size of unacked messages that stream loader can hold in memory at once
loader.consumerSettings.maxAckExtensionPeriodDefaults to 1 hour. The maximum period a message ACK deadline will be extended to.
loader.consumerSettings.awaitTerminatePeriodDefaults to 10 seconds. If the underlying PubSub subcriber fails to terminate cleanly, how long do we wait until it's forcibly timed out
loader.sinkSettings.good.*Settings for the good sink value in the StreamLoader code. For more details see here. For recommended number of records in each request, see here. For the HTTP request size limit, see here.
loader.sinkSettings.bad.*Settings for the bad sink value in the StreamLoader code. For more details see here.
loader.sinkSettings.types.*Settings for the type sink value in the StreamLoader code. For more details see here.
loader.sinkSettings.failedInserts.*Settings for the failed insert sink value in the StreamLoader code. For more details see here.
loader.retrySettings.*Retry settings for the BigQuery client. For more details see here.
loader.terminationTimeoutDefaults to 1 minute. Specifies how long to wait before terminating the application after receiving SIGINT. This is meant to allow time for all events in-flight to be processed and acknowledged before exiting.

Config parser hints

These settings only exist as hints to the config parsing library we use, so that the configuration can be represented as Scala code. They each only have one possible value and should never be changed.