Skip to main content

Lake Loader configuration reference

The configuration reference in this page is written for Lake Loader 0.3.0

Table configurationโ€‹

ParameterDescription
output.good.locationRequired, e.g. gs://mybucket/events. URI of the bucket location to which to write Snowplow enriched events in Delta format. The URI should start with the following prefix:
  • s3a:// on AWS
  • gs:// on GCP
  • abfs:// on Azure
output.good.dataSkippingColumnsOptional. A list of column names which will be brought to the "left-hand-side" of the events table, to enable Delta's data skipping feature. Defaults to the important Snowplow timestamp columns: load_tstamp, collector_tstamp, derived_tstamp, dvce_created_tstamp.

Streams configurationโ€‹

ParameterDescription
input.streamNameRequired. Name of the Kinesis stream with the enriched events
input.appNameOptional, default snowplow-lake-loader. Name to use for the dynamodb table, used by the underlying Kinesis Consumer Library for managing leases.
input.initialPositionOptional, default LATEST. Allowed values are LATEST, TRIM_HORIZON, AT_TIMESTAMP. When the loader is deployed for the first time, this controls from where in the kinesis stream it should start consuming events. On all subsequent deployments of the loader, the loader will resume from the offsets stored in the DynamoDB table.
input.initialPosition.timestampRequired if input.initialPosition is AT_TIMESTAMP. A timestamp in ISO8601 format from where the loader should start consuming events.
input.retrievalModeOptional, default Polling. Change to FanOut to enable the enhance fan-out feature of Kinesis.
input.retrievalMode.maxRecordsOptional. Default value 1000. How many events the Kinesis client may fetch in a single poll. Only used when `input.retrievalMode` is Polling.
input.bufferSizeOptional. Default value 1. The number of batches of events which are pre-fetched from kinesis. The default value is known to work well.
output.bad.streamNameRequired. Name of the Kinesis stream that will receive failed events.
output.bad.throttledBackoffPolicy.minBackoffOptional. Default value 100 milliseconds. Initial backoff used to retry sending failed events if we exceed the Kinesis write throughput limits.
output.bad.throttledBackoffPolicy.maxBackoffOptional. Default value 1 second. Maximum backoff used to retry sending failed events if we exceed the Kinesis write throughput limits.
output.bad.recordLimitOptional. Default value 500. The maximum number of records we are allowed to send to Kinesis in 1 PutRecords request.
output.bad.byteLimitOptional. Default value 5242880. The maximum number of bytes we are allowed to send to Kinesis in 1 PutRecords request.

Other configuration optionsโ€‹

ParameterDescription
windowingOptional. Default value 5 minutes. Controls how often the loader writes/commits pending events to the lake.
spark.taskRetriesOptional. Default value 3. How many times the internal spark context should be retry a task in case of failure
spark.conf.*Optional. A map of key/value strings which are passed to the internal spark context.
monitoring.metrics.statsd.hostnameOptional. If set, the loader sends statsd metrics over UDP to a server on this host name.
monitoring.metrics.statsd.portOptional. Default value 8125. If the statsd server is configured, this UDP port is used for sending metrics.
monitoring.metrics.statsd.tags.*Optional. A map of key/value pairs to be sent along with the statsd metric.
monitoring.metrics.statsd.periodOptional. Default 1 minute. How often to report metrics to statsd.
monitoring.metrics.statsd.prefixOptional. Default snowplow.lakeloader. Prefix used for the metric name when sending to statsd.
sentry.dsnOptional. Set to a Sentry URI to report unexpected runtime exceptions.
sentry.tags.*Optional. A map of key/value strings which are passed as tags when reporting exceptions to Sentry.
telemetry.disableOptional. Set to true to disable telemetry.
telemetry.userProvidedIdOptional. See here for more information.
Was this page helpful?