- Load collector payloads from the "raw" stream, to maintain an archive of the original data, before enrichment.
- Load enriched events from the "enriched" stream. These serve as input for the RDB loader when loading to a warehouse.
- Load failed events from the "bad" stream.
Records that can't be successfully written to S3 are written to a second Kinesis stream with the error message.
Records are treated as raw byte arrays. Elephant Bird's
BinaryBlockWriter class is used to serialize them as a Protocol Buffers array (so it is clear where one record ends and the next begins) before compressing them.
The compression process generates both compressed .lzo files and small .lzo.index files (splittable LZO). Each index file contain the byte offsets of the LZO blocks in the corresponding compressed file, meaning that the blocks can be processed in parallel.
LZO encoding is generally used for raw data produced by Snowplow Collector.
The records are treated as byte arrays containing UTF-8 encoded strings (whether CSV, JSON or TSV). New lines are used to separate records written to a file. This format can be used with the Snowplow Kinesis Enriched stream, among other streams.
Gzip encoding is generally used for both enriched data and bad data.
Available on Terraform Registry
A Terraform module which deploys the Snowplow S3 Loader on AWS EC2 for use with Kinesis. For installing in other environments, please see the other installation options below.
We publish three different flavours of the docker image.
:2.2.4tag if you only need GZip output format
:2.2.4-lzotag if you also need LZO output format
:2.2.4-distrolesstag for an lightweight alternative to
docker pull snowplow/snowplow-s3-loader:2.2.4
docker pull snowplow/snowplow-s3-loader:2.2.4-lzo
docker pull snowplow/snowplow-s3-loader:2.2.4-distroless
Here is a standard command to run the loader on a EC2 instance in AWS:
docker run \
--name snowplow-s3-loader \
--restart always \
--log-driver awslogs \
--log-opt awslogs-group=snowplow-s3-loader \
--log-opt awslogs-stream='ec2metadata --instance-id' \
--network host \
-v $(pwd):/snowplow/config \
-e 'JAVA_OPTS=-Xms512M -Xmx1024M -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN' \
JARs can be found attached to the Github release. Only pick the
-lzo version of the JAR file if you need to output in LZO format
java -jar snowplow-s3-loader-2.2.4.jar --config config.hocon
java -jar snowplow-s3-loader-lzo-2.2.4.jar --config config.hocon
Running the jar requires to have the native LZO binaries installed. For example for Debian this can be done with:
sudo apt-get install lzop liblzo2-dev