Skip to main content

BigQuery Loader

Overview

The BigQuery Streaming Loader on AWS is a fully streaming application that continually pulls events from Kinesis and writes to BigQuery using the BigQuery Storage API.

The BigQuery Loader is published as a Docker image which you can run on any AWS VM.

docker pull snowplow/bigquery-loader-kinesis:2.0.0

To run the loader, mount your config file into the docker image, and then provide the file path on the command line.

docker run \
--mount=type=bind,source=/path/to/myconfig,destination=/myconfig \
snowplow/bigquery-loader-kinesis:2.0.0 \
--config=/myconfig/loader.hocon \
--iglu-config /myconfig/iglu.hocon

Where loader.hocon is loader's configuration file and iglu.hocon is iglu resolver configuration.

Schemas in BigQuery

For more information on how events are stored in BigQuery, check the mapping between Snowplow schemas and the corresponding BigQuery column types.

Configuring the loader

The loader config file is in HOCON format, and it allows configuring many different properties of how the loader runs.

The simplest possible config file just needs a description of your pipeline inputs and outputs:

config/config.kinesis.minimal.hocon
loading...

See the configuration reference for all possible configuration parameters.

Iglu

The BigQuery Loader requires an Iglu resolver file which describes the Iglu repositories that host your schemas. This should be the same Iglu configuration file that you used in the Enrichment process.

Metrics

The BigQuery Loader can be configured to send the following custom metrics to a StatsD receiver:

MetricDefinition
events_goodA count of events that are successfully written to BigQuery.
events_badA count of failed events that could not be loaded, and were instead sent to the bad output stream.
latency_millisThe time in milliseconds from when events are written to the source stream of events (i.e. by Enrich) until when they are read by the loader.
e2e_latency_millisThe end-to-end latency of the snowplow pipeline. The time in milliseconds from when an event was received by the collector, until it is written into BigQuery.

See the monitoring.metrics.statsd options in the configuration reference for how to configure the StatsD receiver.

Telemetry notice

By default, Snowplow collects telemetry data for BigQuery Loader (since version 2.0.0). Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the telemetry.userProvidedId configuration setting.

If you wish to disable telemetry, you can do so by setting telemetry.disable to true.

See our telemetry principles for more information.