Skip to main content

What is deployed?

info
This documentation only applies to Snowplow Community Edition. See the feature comparison page for more information about the different Snowplow offerings.

Let’s take a look at what is deployed when you follow the quick start guide.

tip

You can very easily edit the script or run each of the Terraform modules independently, giving you the flexibility to design the topology of your pipeline according to your needs.

Overview

The main components of the pipeline

Archival and failed events

Collector load balancer

This is an application load balancer for your inbound HTTP(S) traffic. Traffic is routed from the load balancer to the Collector instances.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Collector

This is an application that receives raw Snowplow events over HTTP(S), serializes them to a Thrift record format, and then writes them to Kinesis (on AWS), Pub/Sub (on GCP) or Kafka / Event Hubs (on Azure). More details can be found here.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Enrich

This is an application that reads the raw Snowplow events, validates them (including validation against schemas), enriches them and writes the enriched events to another stream. More details can be found here.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu

The Iglu stack allows you to manage schemas.

Iglu load balancer

This load balances the inbound traffic and routes traffic to the Iglu Server.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu Server

The Iglu Server serves requests for Iglu schemas stored in your schema registry.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu database

This is the Iglu Server database (RDS on AWS, CloudSQL on GCP and PostgreSQL on Azure) where the Iglu schemas themselves are stored.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Streams

The various streams (Kinesis on AWS, Pub/Sub on GCP and Kafka / Event Hubs on Azure) are a key component of ensuring a non-lossy pipeline, providing crucial back-up, as well as serving as a mechanism to drive real time use cases from the enriched stream.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

AWS only — DynamoDB

On the first run of each of the applications consuming from Kinesis (e.g. Enrich), the Kinesis Connectors Library creates a DynamoDB table to keep track of what they have consumed from the stream so far. Each Kinesis consumer maintains its own checkpoint information.

The DynamoDB autoscaling module enables autoscaling for a target DynamoDB table. Note that there is a kcl_write_max_capacity variable which can be set to your expected RPS, but setting it high will of course incur more cost.

You can find further details in the DynamoDB Terraform module.

Raw stream

Collector payloads are written to the raw stream, before being picked up by the Enrich application.

AWS only

The S3 loader (raw) also reads from this raw stream and writes to the raw S3 folder.

Enriched stream

Events that have been validated and enriched by the Enrich application are written to the enriched stream. Depending on your cloud and destination, different loaders pick up the data from this stream, as shown on the diagram above.

AWS only

The S3 loader (enriched) also reads from this enriched stream and writes to the enriched folder on S3.

Bad 1 stream

This bad stream is for failed events, which get created when the Collector, Enrich or various loader applications fail to process the data.

Other streams

The Bad 2 stream is for failed events generated by the S3 loader as it tries to write from the Bad 1 stream to the bad folder on S3.

If you selected Redshift, Snowflake or Databricks as your destination, the loader will use an extra SQS stream internally as explained in the loading process.

Archival and failed events

Aside from your main destination, the data is written to S3 by the S3 Loader applications for archival and to deal with failed events.

See the S3 Loader and S3 Terraform modules for further details on the resources, default and required input variables, and outputs.

The following loaders and folders are available:

  • Raw loader, raw/: events that come straight out of the Collector and have not yet been validated or enriched by the Enrich application. They are Thrift records and are therefore a little tricky to decode. There are not many reasons to use this data, but backing this data up gives you the flexibility to replay this data should something go wrong further downstream in the pipeline.
  • Enriched loader, enriched/: enriched events, in GZipped blobs of enriched TSV. Historically, this has been used as the staging ground for loading into data warehouses via the Batch transformer application. However, it’s no longer used in the quick start examples.
  • Bad loader, bad/: failed events. You can query them using Athena.

Also, if you choose Postgres as your destination, the Postgres loader will load all failed events into Postgres.

Loaders

The Postgres Loader loads enriched events and failed events to Postgres.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).