Skip to main content

What is deployed?

Let’s take a look at what is deployed when you follow the quick start guide.

tip

You can very easily edit the script or run each of the Terraform modules independently, giving you the flexibility to design the topology of your pipeline according to your needs.

Overview

The main components of the pipeline

Archival and failed events

Redshift Loader

For more information about the Redshift Loader, see the documentation on the available destinations.

Collector load balancer

This is an application load balancer for your inbound HTTP(S) traffic. Traffic is routed from the load balancer to the Collector instances.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Collector

This is an application that receives raw Snowplow events over HTTP(S), serializes them, and then writes them to Kinesis (on AWS), Pub/Sub (on GCP) or Kafka / Event Hubs (on Azure). More details can be found here.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Enrich

This is an application that reads the raw Snowplow events, validates them (including validation against schemas), enriches them and writes the enriched events to another stream. More details can be found here.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu

The Iglu stack allows you to manage schemas.

Iglu load balancer

This load balances the inbound traffic and routes traffic to the Iglu Server.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu Server

The Iglu Server serves requests for Iglu schemas stored in your schema registry.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu database

This is the Iglu Server database (RDS on AWS, CloudSQL on GCP and PostgreSQL on Azure) where the Iglu schemas themselves are stored.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Streams

The various streams (Kinesis on AWS, Pub/Sub on GCP and Kafka / Event Hubs on Azure) are a key component of ensuring a non-lossy pipeline, providing crucial back-up, as well as serving as a mechanism to drive real time use cases from the enriched stream.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

AWS only — DynamoDB

On the first run of each of the applications consuming from Kinesis (e.g. Enrich), the Kinesis Connectors Library creates a DynamoDB table to keep track of what they have consumed from the stream so far. Each Kinesis consumer maintains its own checkpoint information.

The DynamoDB autoscaling module enables autoscaling for a target DynamoDB table. Note that there is a kcl_write_max_capacity variable which can be set to your expected RPS, but setting it high will of course incur more cost.

You can find further details in the DynamoDB Terraform module.

Raw stream

Collector payloads are written to the raw stream, before being picked up by the Enrich application.

Enriched stream

Events that have been validated and enriched by the Enrich application are written to the enriched stream. Depending on your cloud and destination, different loaders pick up the data from this stream, as shown on the diagram above.

AWS only

The S3 loader (enriched) also reads from this enriched stream and writes to the enriched folder on S3.

Bad 1 stream

This bad stream is for failed events, which get created when the Collector, Enrich or various loader applications fail to process the data.

Other streams

The Bad 2 stream is for failed events generated by the S3 loader as it tries to write from the Bad 1 stream to the bad folder on S3.

If you selected Redshift, Snowflake or Databricks as your destination, the loader will use an extra SQS stream internally as explained in the reference documentation.

Archival and failed events

Aside from your main destination, the data is written to S3 by the S3 Loader applications for archival and to deal with failed events.

See the S3 Loader and S3 Terraform modules for further details on the resources, default and required input variables, and outputs.

The following loaders and folders are available:

  • Enriched loader, enriched/: enriched events, in GZipped blobs of enriched TSV. Historically, this has been used as the staging ground for loading into data warehouses via the Batch transformer application. However, it’s no longer used in the quick start examples.
  • Bad loader, bad/: failed events. You can query them using Athena.

Loaders

RDB Loader is a set of applications that loads enriched events into Redshift.

See the following Terraform modules for further details on the resources, default and required input variables, and outputs: