BigQuery Loader configuration reference
The configuration reference in this page is written for BigQuery Loader 2.0.0
BigQuery configuration
Parameter | Description |
---|---|
output.good.project | Required. The GCP project to which the BigQuery dataset belongs |
output.good.dataset | Required. The BigQuery dataset to which events will be loaded |
output.good.table | Optional. Default value events . Name to use for the events table |
output.good.credentials | Optional. Service account credentials (JSON). If not set, default credentials will be sourced from the usual locations, e.g. file pointed to by the GOOGLE_APPLICATION_CREDENTIALS environment variable |
Streams configuration
- AWS
- GCP
- Azure
Parameter | Description |
---|---|
input.streamName | Required. Name of the Kinesis stream with the enriched events |
input.appName | Optional, default snowplow-bigquery-loader . Name to use for the dynamodb table, used by the underlying Kinesis Consumer Library for managing leases. |
input.initialPosition | Optional, default LATEST . Allowed values are LATEST , TRIM_HORIZON , AT_TIMESTAMP . When the loader is deployed for the first time, this controls from where in the kinesis stream it should start consuming events. On all subsequent deployments of the loader, the loader will resume from the offsets stored in the DynamoDB table. |
input.initialPosition.timestamp | Required if input.initialPosition is AT_TIMESTAMP . A timestamp in ISO8601 format from where the loader should start consuming events. |
input.retrievalMode | Optional, default Polling. Change to FanOut to enable the enhance fan-out feature of Kinesis. |
input.retrievalMode.maxRecords | Optional. Default value 1000. How many events the Kinesis client may fetch in a single poll. Only used when `input.retrievalMode` is Polling. |
input.workerIdentifier | Optional. Defaults to the HOSTNAME environment variable. The name of this KCL worker used in the dynamodb lease table. |
input.leaseDuration | Optional. Default value 10 seconds . The duration of shard leases. KCL workers must periodically refresh leases in the dynamodb table before this duration expires. |
input.maxLeasesToStealAtOneTimeFactor | Optional. Default value 2.0 . Controls how to pick the max number of shard leases to steal at one time. E.g. If there are 4 available processors, and maxLeasesToStealAtOneTimeFactor = 2.0 , then allow the loader to steal up to 8 leases. Allows bigger instances to more quickly acquire the shard-leases they need to combat latency. |
input.checkpointThrottledBackoffPolicy.minBackoff | Optional. Default value 100 milliseconds . Initial backoff used to retry checkpointing if we exceed the DynamoDB provisioned write limits. |
input.checkpointThrottledBackoffPolicy.maxBackoff | Optional. Default value 1 second . Maximum backoff used to retry checkpointing if we exceed the DynamoDB provisioned write limits. |
output.bad.streamName | Required. Name of the Kinesis stream that will receive failed events. |
output.bad.throttledBackoffPolicy.minBackoff | Optional. Default value 100 milliseconds . Initial backoff used to retry sending failed events if we exceed the Kinesis write throughput limits. |
output.bad.throttledBackoffPolicy.maxBackoff | Optional. Default value 1 second . Maximum backoff used to retry sending failed events if we exceed the Kinesis write throughput limits. |
output.bad.recordLimit | Optional. Default value 500. The maximum number of records we are allowed to send to Kinesis in 1 PutRecords request. |
output.bad.byteLimit | Optional. Default value 5242880. The maximum number of bytes we are allowed to send to Kinesis in 1 PutRecords request. |
output.bad.maxRecordSize.* | Optional. Default value 1000000. Any single event failed event sent to Kinesis should not exceed this size in bytes |
Parameter | Description |
---|---|
input.subscription | Required, e.g. projects/myproject/subscriptions/snowplow-enriched . Name of the Pub/Sub subscription with the enriched events |
input.parallelPullFactor | Optional. Default value 0.5. parallelPullFactor * cpu count will determine the number of threads used internally by the Pub/Sub client library for fetching events |
input.durationPerAckExtension | Optional. Default value 60 seconds . Pubsub ack deadlines are extended for this duration when needed. |
input.minRemainingAckDeadline | Optional. Default value 0.1. Controls when ack deadlines are re-extended, for a message that is close to exceeding its ack deadline. For example, if durationPerAckExtension is 60 seconds and minRemainingAckDeadline is 0.1 then the loader will wait until there is 6 seconds left of the remining deadline, before re-extending the message deadline. |
input.maxMessagesPerPull | Optional. Default value 1000. How many Pub/Sub messages to pull from the server in a single request. |
input.debounceRequests | Optional. Default value 100 millis . Adds an artifical delay between consecutive requests to Pub/Sub for more messages. Under some circumstances, this was found to slightly alleviate a problem in which Pub/Sub might re-deliver the same messages multiple times. |
output.bad.topic | Required, e.g. projects/myproject/topics/snowplow-bad . Name of the Pub/Sub topic that will receive failed events. |
output.bad.batchSize | Optional. Default value 1000. Bad events are sent to Pub/Sub in batches not exceeding this count. |
output.bad.requestByteThreshold | Optional. Default value 1000000. Bad events are sent to Pub/Sub in batches with a total size not exceeding this byte threshold |
output.bad.maxRecordSize | Optional. Default value 10000000. Any single failed event sent to Pub/Sub should not exceed this size in bytes |
Parameter | Description |
---|---|
input.topicName | Required. Name of the Kafka topic for the source of enriched events. |
input.bootstrapServers | Required. Hostname and port of Kafka bootstrap servers hosting the source of enriched events. |
input.consumerConf.* | Optional. A map of key/value pairs for any standard Kafka consumer configuration option. |
output.bad.topicName | Required. Name of the Kafka topic that will receive failed events. |
output.bad.bootstrapServers | Required. Hostname and port of Kafka bootstrap servers hosting the bad topic |
output.bad.producerConf.* | Optional. A map of key/value pairs for any standard Kafka producer configuration option. |
output.bad.maxRecordSize.* | Optional. Default value 1000000. Any single failed event sent to Kafka should not exceed this size in bytes |
Event Hubs Authentication
You can use the input.consumerConf
and output.bad.producerConf
options to configure authentication to Azure event hubs using SASL. For example:
"input.consumerConf": {
"security.protocol": "SASL_SSL"
"sasl.mechanism": "PLAIN"
"sasl.jaas.config": "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"\$ConnectionString\" password=<PASSWORD>;"
}
Other configuration options
Parameter | Description |
---|---|
batching.maxBytes | Optional. Default value 10000000 . Events are emitted to BigQuery when the batch reaches this size in bytes |
batching.maxDelay | Optional. Default value 1 second . Events are emitted to BigQuery after a maximum of this duration, even if the maxBytes size has not been reached |
batching.writeBatchConcurrency | Optional. Default value 2. How many batches can we send simultaneously over the network to BigQuery |
cpuParallelism.parseBytesFactor | Optional. Default value 0.1 . Controls how many batches of bytes we can parse into enriched events simultaneously. E.g. If there are 2 cores and parseBytesFactor = 0.1 then only one batch gets processed at a time. Adjusting this value can cause the app to use more or less of the available CPU. |
cpuParallelism.transformFactor | Optional. Default value 0.75 . Controls how many batches of enriched events we can transform into BigQuery format simultaneously. E.g. If there are 4 cores and transformFactor = 0.75 then 3 batches gets processed in parallel. Adjusting this value can cause the app to use more or less of the available CPU. |
retries.setupErrors.delay | Optional. Default value 30 seconds . Configures exponential backoff on errors related to how BigQuery is set up for this loader. Examples include authentication errors and permissions errors. This class of errors are reported periodically to the monitoring webhook. |
retries.transientErrors.delay | Optional. Default value 1 second . Configures exponential backoff on errors that are likely to be transient. Examples include server errors and network errors. |
retries.transientErrors.attempts | Optional. Default value 5. Maximum number of attempts to make before giving up on a transient error. |
skipSchemas | Optional, e.g. ["iglu:com.example/skipped1/jsonschema/1-0-0"] or with wildcards ["iglu:com.example/skipped2/jsonschema/1-*-*"] . A list of schemas that won't be loaded to BigQuery. This feature could be helpful when recovering from edge-case schemas which for some reason cannot be loaded to the table. |
legacyColumnMode | Optional. Default value false . When this mode is enabled, the loader uses the legacy column style used by the v1 BigQuery loader. For example, an entity for a 1-0-0 schema is loaded into a column ending in _1_0_0 , instead of a column ending in _1 . This feature could be helpful when migrating from the v1 loader to the v2 loader. |
legacyColumns | Optional, e.g. ["iglu:com.example/legacy/jsonschema/1-0-0"] or with wildcards ["iglu:com.example/legacy/jsonschema/1-*-*"] . Schemas for which to use the legacy column style used by the v1 BigQuery loader, even when legacyColumnMode is disabled. |
exitOnMissingIgluSchema | Optional. Default value true . Whether the loader should crash and exit if it fails to resolve an Iglu Schema. We recommend `true` because Snowplow enriched events have already passed validation, so a missing schema normally indicates an error that needs addressing. Change to false so events go the failed events stream instead of crashing the loader. |
monitoring.metrics.statsd.hostname | Optional. If set, the loader sends statsd metrics over UDP to a server on this host name. |
monitoring.metrics.statsd.port | Optional. Default value 8125. If the statsd server is configured, this UDP port is used for sending metrics. |
monitoring.metrics.statsd.tags.* | Optional. A map of key/value pairs to be sent along with the statsd metric. |
monitoring.metrics.statsd.period | Optional. Default 1 minute . How often to report metrics to statsd. |
monitoring.metrics.statsd.prefix | Optional. Default snowplow.bigquery-loader . Prefix used for the metric name when sending to statsd. |
monitoring.webhook.endpoint | Optional, e.g. https://webhook.example.com . The loader will send to the webhook a payload containing details of any error related to how BigQuery is set up for this loader. |
monitoring.webhook.tags.* | Optional. A map of key/value strings to be included in the payload content sent to the webhook. |
monitoring.webhook.heartbeat.* | Optional. Default value 5.minutes . How often to send a heartbeat event to the webhook when healthy. |
monitoring.sentry.dsn | Optional. Set to a Sentry URI to report unexpected runtime exceptions. |
monitoring.sentry.tags.* | Optional. A map of key/value strings which are passed as tags when reporting exceptions to Sentry. |
telemetry.disable | Optional. Set to true to disable telemetry. |
telemetry.userProvidedId | Optional. See here for more information. |
http.client.maxConnectionsPerServer | Optional. Default value 4. Configures the internal HTTP client used for iglu resolver, alerts and telemetry. The maximum number of open HTTP requests to any single server at any one time. |