Skip to main content

What is deployed?

Let’s take a look at what is deployed when you follow the quick start guide.

tip

You can very easily edit the script or run each of the Terraform modules independently, giving you the flexibility to design the topology of your pipeline according to your needs.

Overview#

AWS
GCP
Azure

Redshift
Snowflake
Databricks

The main components of the pipeline

Archival and failed events

Redshift Loader

For more information about the Redshift Loader, see the documentation on the available destinations.

Collector load balancer#

This is an application load balancer for your inbound HTTP(S) traffic. Traffic is routed from the load balancer to the Collector instances.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Collector#

This is an application that receives raw Snowplow events over HTTP(S), serializes them, and then writes them to Kinesis (on AWS), Pub/Sub (on GCP) or Kafka / Event Hubs (on Azure). More details can be found here.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Enrich#

This is an application that reads the raw Snowplow events, validates them (including validation against schemas), enriches them and writes the enriched events to another stream. More details can be found here.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu#

The Iglu stack allows you to manage schemas.

Iglu load balancer#

This load balances the inbound traffic and routes traffic to the Iglu Server.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu Server#

The Iglu Server serves requests for Iglu schemas stored in your schema registry.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Iglu database#

This is the Iglu Server database (RDS on AWS, CloudSQL on GCP and PostgreSQL on Azure) where the Iglu schemas themselves are stored.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

Streams#

The various streams (Kinesis on AWS, Pub/Sub on GCP and Kafka / Event Hubs on Azure) are a key component of ensuring a non-lossy pipeline, providing crucial back-up, as well as serving as a mechanism to drive real time use cases from the enriched stream.

For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).

AWS only — DynamoDB

On the first run of each of the applications consuming from Kinesis (e.g. Enrich), the Kinesis Connectors Library creates a DynamoDB table to keep track of what they have consumed from the stream so far. Each Kinesis consumer maintains its own checkpoint information.

The DynamoDB autoscaling module enables autoscaling for a target DynamoDB table. Note that there is a kcl_write_max_capacity variable which can be set to your expected RPS, but setting it high will of course incur more cost.

You can find further details in the DynamoDB Terraform module.

Raw stream#

Collector payloads are written to the raw stream, before being picked up by the Enrich application.

Enriched stream#

Events that have been validated and enriched by the Enrich application are written to the enriched stream. Depending on your cloud and destination, different loaders pick up the data from this stream, as shown on the diagram above.

AWS only

The S3 loader (enriched) also reads from this enriched stream and writes to the enriched folder on S3.

Bad 1 stream#

This bad stream is for failed events, which get created when the Collector, Enrich or various loader applications fail to process the data.

Other streams#

AWS
GCP
Azure

The Bad 2 stream is for failed events generated by the S3 loader as it tries to write from the Bad 1 stream to the bad folder on S3.

If you selected Redshift, Snowflake or Databricks as your destination, the loader will use an extra SQS stream internally as explained in the reference documentation.

Archival and failed events#

AWS
GCP
Azure

Aside from your main destination, the data is written to S3 by the S3 Loader applications for archival and to deal with failed events.

See the S3 Loader and S3 Terraform modules for further details on the resources, default and required input variables, and outputs.

The following loaders and folders are available:

Enriched loader, enriched/: enriched events, in GZipped blobs of enriched TSV. Historically, this has been used as the staging ground for loading into data warehouses via the Batch transformer application. However, it’s no longer used in the quick start examples.
Bad loader, bad/: failed events. You can query them using Athena.

Currently, failed events are only available in the bad Kafka / Event Hubs topic.

Loaders#

RDB Loader is a set of applications that loads enriched events into Redshift.

See the following Terraform modules for further details on the resources, default and required input variables, and outputs:

The BigQuery Loader is a set of applications that loads enriched events into BigQuery.

See the Terraform module for further details on the resources, default and required input variables, and outputs.

There will be a new dataset available in BigQuery with the suffix _snowplow_db. Within this dataset, there will be a table called events — all of your collected events will be available here generally within a few seconds after they are sent into the pipeline.

On this page

Overview
Collector load balancer
Collector
Enrich
Iglu
Streams
Archival and failed events
Loaders

Want to see a custom demo?

Our technical experts are here to help.