What is deployed?
Let’s take a look at what is deployed when you follow the quick start guide.
You can very easily edit the script or run each of the Terraform modules independently, giving you the flexibility to design the topology of your pipeline according to your needs.
Overview
- AWS
- GCP
- Azure
- Postgres
- Redshift
- Snowflake
- Databricks
The main components of the pipeline
Archival and failed events
Collector load balancer
This is an application load balancer for your inbound HTTP(S) traffic. Traffic is routed from the load balancer to the Collector instances.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Collector
This is an application that receives raw Snowplow events over HTTP(S), serializes them to a Thrift record format, and then writes them to Kinesis (on AWS), Pub/Sub (on GCP) or Kafka / Event Hubs (on Azure). More details can be found here.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Enrich
This is an application that reads the raw Snowplow events, validates them (including validation against schemas), enriches them and writes the enriched events to another stream. More details can be found here.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Iglu
The Iglu stack allows you to manage schemas.
Iglu load balancer
This load balances the inbound traffic and routes traffic to the Iglu Server.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Iglu Server
The Iglu Server serves requests for Iglu schemas stored in your schema registry.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Iglu database
This is the Iglu Server database (RDS on AWS, CloudSQL on GCP and PostgreSQL on Azure) where the Iglu schemas themselves are stored.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
Streams
The various streams (Kinesis on AWS, Pub/Sub on GCP and Kafka / Event Hubs on Azure) are a key component of ensuring a non-lossy pipeline, providing crucial back-up, as well as serving as a mechanism to drive real time use cases from the enriched stream.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
AWS only — DynamoDB
On the first run of each of the applications consuming from Kinesis (e.g. Enrich), the Kinesis Connectors Library creates a DynamoDB table to keep track of what they have consumed from the stream so far. Each Kinesis consumer maintains its own checkpoint information.
The DynamoDB autoscaling module enables autoscaling for a target DynamoDB table. Note that there is a kcl_write_max_capacity
variable which can be set to your expected RPS, but setting it high will of course incur more cost.
You can find further details in the DynamoDB Terraform module.
Raw stream
Collector payloads are written to the raw stream, before being picked up by the Enrich application.
The S3 loader (raw) also reads from this raw stream and writes to the raw S3 folder.
Enriched stream
Events that have been validated and enriched by the Enrich application are written to the enriched stream. Depending on your cloud and destination, different loaders pick up the data from this stream, as shown on the diagram above.
The S3 loader (enriched) also reads from this enriched stream and writes to the enriched folder on S3.
Bad 1 stream
This bad stream is for failed events, which get created when the Collector, Enrich or various loader applications fail to process the data.
Other streams
- AWS
- GCP
- Azure
The Bad 2 stream is for failed events generated by the S3 loader as it tries to write from the Bad 1 stream to the bad folder on S3.
If you selected Redshift, Snowflake or Databricks as your destination, the loader will use an extra SQS stream internally as explained in the loading process.
If you selected BigQuery as your destination, the Bad Rows stream will contain events that could not be inserted into BigQuery by the loader. This includes data that is not valid against its schema or that is somehow corrupted in a way that the loader cannot handle. In addition, the loader users a few streams internally, as explained in the loading process.
No other streams.
Archival and failed events
- AWS
- GCP
- Azure
Aside from your main destination, the data is written to S3 by the S3 Loader applications for archival and to deal with failed events.
See the S3 Loader and S3 Terraform modules for further details on the resources, default and required input variables, and outputs.
The following loaders and folders are available:
- Raw loader,
raw/
: events that come straight out of the Collector and have not yet been validated or enriched by the Enrich application. They are Thrift records and are therefore a little tricky to decode. There are not many reasons to use this data, but backing this data up gives you the flexibility to replay this data should something go wrong further downstream in the pipeline. - Enriched loader,
enriched/
: enriched events, in GZipped blobs of enriched TSV. Historically, this has been used as the staging ground for loading into data warehouses via the Batch transformer application. However, it’s no longer used in the quick start examples. - Bad loader,
bad/
: failed events. You can query them using Athena.
Also, if you choose Postgres as your destination, the Postgres loader will load all failed events into Postgres.
If you choose Postgres as your destination, the Postgres loader will load all failed events into Postgres, although not to GCS.
If you choose BigQuery as your destination, there will be a “dead letter” GCS bucket. It will have the suffix -bq-loader-dead-letter
and will contain events that the loader fails to be insert into BigQuery, but not any other kind of failed events. To store all failed events, you will need to manually deploy the GCS Loader application.
Currently, failed events are only available in the bad
Kafka / Event Hubs topic.
Loaders
- Postgres
- Redshift
- BigQuery
- Snowflake
- Databricks (direct)
- Databricks (via lake)
- Synapse Analytics
The Postgres Loader loads enriched events and failed events to Postgres.
For further details on the resources, default and required input variables, and outputs, see the Terraform module (AWS, GCP, Azure).
RDB Loader is a set of applications that loads enriched events into Redshift.
See the following Terraform modules for further details on the resources, default and required input variables, and outputs:
The BigQuery Loader is a set of applications that loads enriched events into BigQuery.
See the Terraform module for further details on the resources, default and required input variables, and outputs.
There will be a new dataset available in BigQuery with the suffix _snowplow_db
. Within this dataset, there will be a table called events
— all of your collected events will be available here generally within a few seconds after they are sent into the pipeline.
RDB Loader is a set of applications that loads enriched events into Snowflake.
Alternatively, for AWS you can choose the newer Snowflake Streaming Loader, which is a single application.
See the following Terraform modules for further details on the resources, default and required input variables, and outputs:
- RDB Loader:
- Transformer component (AWS — Kinesis, Azure — Kafka / Event Hubs)
- Snowflake Loader component (AWS, Azure)
- Snowflake Streaming Loader (AWS)
RDB Loader is a set of applications that loads enriched events into Databricks.
See the following Terraform modules for further details on the resources, default and required input variables, and outputs:
Lake Loader is an application that loads enriched events into a data lake so that they can be queried via Databricks (or other means).
See the Lake Loader Terraform module for further details on the resources, default and required input variables, and outputs.
The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.
Lake Loader is an application that loads enriched events into a data lake so that they can be queried via Synapse Analytics (or Fabric, OneLake, etc).
See the Lake Loader Terraform module for further details on the resources, default and required input variables, and outputs.
The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.