What is deployed?
Let’s take a look at what is deployed when you follow the quick start guide.
You can very easily edit the script or run each of the Terraform modules independently, giving you the flexibility to design the topology of your pipeline according to your needs.
Overview
- AWS
 - GCP
 - Azure
 
- Redshift
 - Snowflake
 - Databricks
 
The main components of the pipeline
Archival and failed events
Redshift Loader
For more information about the Redshift Loader, see the documentation on the available destinations.
Collector load balancer
This is an application load balancer for your inbound HTTP(S) traffic. Traffic is routed from the load balancer to the Collector instances.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Collector
This is an application that receives raw Snowplow events over HTTP(S), serializes them, and then writes them to Kinesis (on AWS), Pub/Sub (on GCP) or Kafka / Event Hubs (on Azure). More details can be found here.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Enrich
This is an application that reads the raw Snowplow events, validates them (including validation against schemas), enriches them and writes the enriched events to another stream. More details can be found here.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Iglu
The Iglu stack allows you to manage schemas.
Iglu load balancer
This load balances the inbound traffic and routes traffic to the Iglu Server.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Iglu Server
The Iglu Server serves requests for Iglu schemas stored in your schema registry.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Iglu database
This is the Iglu Server database (RDS on AWS, CloudSQL on GCP and PostgreSQL on Azure) where the Iglu schemas themselves are stored.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Streams
The various streams (Kinesis on AWS, Pub/Sub on GCP and Kafka / Event Hubs on Azure) are a key component of ensuring a non-lossy pipeline, providing crucial back-up, as well as serving as a mechanism to drive real time use cases from the enriched stream.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
AWS only — DynamoDB
On the first run of each of the applications consuming from Kinesis (e.g. Enrich), the Kinesis Connectors Library creates a DynamoDB table to keep track of what they have consumed from the stream so far. Each Kinesis consumer maintains its own checkpoint information.
The DynamoDB autoscaling module enables autoscaling for a target DynamoDB table. Note that there is a kcl_write_max_capacity variable which can be set to your expected RPS, but setting it high will of course incur more cost.
You can find further details in the DynamoDB Terraform module.
Raw stream
Collector payloads are written to the raw stream, before being picked up by the Enrich application.
Enriched stream
Events that have been validated and enriched by the Enrich application are written to the enriched stream. Depending on your cloud and destination, different loaders pick up the data from this stream, as shown on the diagram above.
The S3 loader (enriched) also reads from this enriched stream and writes to the enriched folder on S3.
Bad 1 stream
This bad stream is for failed events, which get created when the Collector, Enrich or various loader applications fail to process the data.
Other streams
- AWS
 - GCP
 - Azure
 
The Bad 2 stream is for failed events generated by the S3 loader as it tries to write from the Bad 1 stream to the bad folder on S3.
If you selected Redshift, Snowflake or Databricks as your destination, the loader will use an extra SQS stream internally as explained in the reference documentation.
If you selected BigQuery as your destination, the Bad Rows stream will contain events that could not be inserted into BigQuery by the loader. This includes data that is not valid against its schema or that is somehow corrupted in a way that the loader cannot handle. In addition, the loader users a few streams internally, as explained in the reference documentation.
No other streams.
Archival and failed events
- AWS
 - GCP
 - Azure
 
Aside from your main destination, the data is written to S3 by the S3 Loader applications for archival and to deal with failed events.
See the S3 Loader and S3 Terraform modules for further details on the resources, default and required input variables, and outputs.
The following loaders and folders are available:
- Enriched loader, 
enriched/: enriched events, in GZipped blobs of enriched TSV. Historically, this has been used as the staging ground for loading into data warehouses via the Batch transformer application. However, it’s no longer used in the quick start examples. - Bad loader, 
bad/: failed events. You can query them using Athena. 
To store all failed events, you will need to manually deploy the GCS Loader application.
Currently, failed events are only available in the bad Kafka / Event Hubs topic.
Loaders
- Redshift
 - BigQuery
 - Snowflake
 - Databricks (direct)
 - Databricks (via lake)
 - Synapse Analytics
 
RDB Loader is a set of applications that loads enriched events into Redshift.
See the following Terraform modules for further details on the resources, default and required input variables, and outputs:
The BigQuery Loader is a set of applications that loads enriched events into BigQuery.
See the Terraform module for further details on the resources, default and required input variables, and outputs.
There will be a new dataset available in BigQuery with the suffix _snowplow_db. Within this dataset, there will be a table called events — all of your collected events will be available here generally within a few seconds after they are sent into the pipeline.
Snowflake Streaming Loader is an application that loads data into Snowflake.
See the following Terraform module for further details on the resources, default and required input variables, and outputs:
- Snowflake Streaming Loader (AWS)
 
RDB Loader is a set of applications that loads enriched events into Databricks.
See the following Terraform modules for further details on the resources, default and required input variables, and outputs:
Lake Loader is an application that loads enriched events into a data lake so that they can be queried via Databricks (or other means).
See the Lake Loader Terraform module for further details on the resources, default and required input variables, and outputs.
The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.
Lake Loader is an application that loads enriched events into a data lake so that they can be queried via Synapse Analytics (or Fabric, OneLake, etc).
See the Lake Loader Terraform module for further details on the resources, default and required input variables, and outputs.
The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.