Skip to main content

S3 loader configuration reference

This is a complete list of the options that can be configured in the S3 loader HOCON config file. The example configs in github show how to prepare an input file.

parameterdescription
purposeRequired. Use RAW to sink data exactly as-is. Use ENRICHED_EVENTS to also enable event latency metrics. Use SELF_DESCRIBING to enable partitioning self-describing data by its schema
input.appNameRequired. Kinesis Client Lib app name (corresponds to DynamoDB table name)
input.streamNameRequired. Name of the kinesis stream from which to read
input.positionRequired. Use TRIM_HORIZON to start streaming at the last untrimmed record in the shard, which is the oldest data record in the shard. Or use LATEST to start streaming just after the most recent record in the shard
input.customEndpointOptional. Override the default endpoint for kinesis client api calls
input.maxRecordsRequired. How many records the client should pull from kinesis each time
output.s3.pathRequired. Full path to output data, e.g. s3://acme-snowplow-output/raw/
output.s3.partitionFormatOptional. Added in version 2.1.0. Configures how files are partitioned into S3 directories.When loading raw files, you might choose to partition by date={yy}-{mm}-{dd}. When loading self describing jsons, you might choose to partition by {vendor}.{name}/model={model}/date={yy}-{mm}-{dd}. Valid substitutions are {vendor}, {name}, {format}, {model} for self-describing jsons; and {yy}, {mm}, {dd}, {hh} for year, month, day and hour. Defaults to {vendor}.{schema} when loading self-describing JSONs, or blank (no partitioning) when loading raw or enriched events
output.s3.filenamePrefixOptional. Adds a prefix to output
output.s3.compressionRequired. Either LZO or GZIP
output.s3.maxTimeoutRequired. Maximum Timeout that the application is allowed to fail for, e.g. in case of S3 outage
output.s3.customEndpointOptional. Override the default endpoint for s3 client api calls
regionOptional. When used with the output.s3.customEndpoint option, this sets the region of the bucket. Also sets the region of the dynamoDB table. Defaults to the current region
output.bad.streamNameRequired. Name of a kinesis stream to output failures
buffer.byteLimitRequired. Maximum bytes to read from kinesis before flushing a file to S3
buffer.recordLimitRequired. Maximum records to read from kinesis before flushing a file to S3
buffer.timeLimitRequired. Maximum time to wait in milliseconds between writing files to S3
monitoring.snowplow.collectorOptional. E.g. https://snplow.acme.com. URI of a snowplow collector. Used for monitoring application lifecycle and failure events
monitoring.snowplow.appIdRequired only if the collector uri is also configured. Sets the appId field of the snowplow events
monitoring.sentry.dsnOptional, for tracking uncaught run time exceptions
monitoring.metrics.cloudwatchOptional boolean, with default true. This is used to disable sending metrics to cloudwatch
monitoring.metrics.hostnameOptional, for sending loading metrics (latency and event counts) to a statsd server
monitoring.metrics.portOptional, port of the statsd server
monitoring.metrics.tagsE.g.{ "key1": "value1", "key2": "value2" }. Tags are used to annotate the statsd metric with any contextual information
monitoring.metrics.prefixOptional, default snoplow.s3loader. Configures the prefix of statsd metric names