Skip to main content

Lake Loader

Overview

The Lake Loader is an application that loads Snowplow events to a cloud storage bucket using Open Table Formats.

Open Table Formats

The Lake Loader supports the three major Open Table Formats: Delta, Iceberg and Hudi.

For Iceberg tables, the loader supports AWS Glue as catalog.

The Lake Loader on AWS is a fully streaming application that continually pulls events from Kinesis and writes to S3.

The Lake Loader is published as a Docker image which you can run on any AWS VM. You do not need a Spark cluster to run this loader.

docker pull snowplow/lake-loader-aws:0.5.0

To run the loader, mount your config files into the docker image, and then provide the file paths on the command line:

docker run \
--mount=type=bind,source=/path/to/myconfig,destination=/myconfig \
snowplow/lake-loader-aws:0.5.0 \
--config=/myconfig/loader.hocon \
--iglu-config=/myconfig/iglu.hocon

For some output formats, you need to pull a slightly different tag to get a compatible docker image. The configuration reference page explains when this is needed.

Configuring the loader

The loader config file is in HOCON format, and it allows configuring many different properties of how the loader runs.

The simplest possible config file just needs a description of your pipeline inputs and outputs:

config/config.aws.minimal.hocon
loading...

See the configuration reference for all possible configuration parameters.

Windowing

"Windowing" is an important config setting, which controls how often the Lake Loader commits a batch of events to the data lake. If you adjust this config setting, you should be aware that data lake queries are most efficient when the size of the parquet files in the lake are relatively large.

  • If you set this to a low value, the loader will write events to the lake more frequently, reducing latency. However, the output parquet files will be smaller, which will make querying the data less efficient.
  • Conversely, if you set this to a high value, the loader will generate bigger output parquet files, which are efficient for queries — at the cost of events arriving to the lake with more delay.

The default setting is 5 minutes. For moderate to high volumes, this value strikes a nice balance between the need for large output parquet files and the need for reasonably low latency data.

{
"windowing": "5 minutes"
}

If you tune this setting correctly, then your lake can support efficient analytic queries without the need to run an OPTIMIZE job on the files.

Iglu

The Lake Loader requires an Iglu resolver file which describes the Iglu repositories that host your schemas. This should be the same Iglu configuration file that you used in the Enrichment process.

Metrics

The Lake Loader can be configured to send the following custom metrics to a StatsD receiver:

MetricDefinition
events_committedA count of events that are successfully written and committed to the lake. Because the loader works in timed windows of several minutes, this metric has a "spiky" value, which is often zero and then periodically spikes up to larger values.
events_receivedA count of events received by the loader. Unlike events_committed this is a smooth varying metric, because the loader is constantly receiving events throughout a timed window.
events_badA count of failed events that could not be loaded, and were instead sent to the bad output stream.
latency_millisThe time in milliseconds from when events are written to the source stream of events (i.e. by Enrich) until when they are read by the loader.
processing_latency_millisFor each window of events, the time in milliseconds from when the first event is read from the stream, until all events are written and committed to the lake.

See the monitoring.metrics.statsd options in the configuration reference for how to configure the StatsD receiver.

Telemetry notice

By default, Snowplow collects telemetry data for Lake Loader (since version 0.1.0). Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the telemetry.userProvidedId configuration setting.

If you wish to disable telemetry, you can do so by setting telemetry.disable to true.

See our telemetry principles for more information.