Skip to main content

Quick start guide

info
This documentation only applies to Snowplow Community Edition. See the feature comparison page for more information about the different Snowplow offerings.

This guide will take you through how to spin up a Snowplow Community Edition pipeline using the Snowplow Terraform modules. (Not familiar with Terraform? Take a look at Infrastructure as code with Terraform.)

Prerequisites

Sign up for Snowplow Community Edition and follow the link in the email to get a copy of the repository containing the Terraform code.

Install Terraform 1.0.0 or higher. Follow the instructions to make sure the terraform binary is available on your PATH. You can also use tfenv to manage your Terraform installation.

Install AWS CLI version 2.

Configure the CLI against a role that has the AdminstratorAccess policy attached.

caution

AdminstratorAccess allows all actions on all AWS services and shouldn't be used in production

Details on how to configure the AWS Terraform Provider can be found on the registry.

Storage options

The sections below will guide you through setting up your destination to receive Snowplow data, but for now here is an overview.

WarehouseAWSGCPAzure
Postgres
Snowflake
Databricks
Redshift
BigQuery
Synapse Analytics

Real-time streaming options

As part of the deployment, your data will be available in real-time streams corresponding to the cloud provider you have chosen. You can consume data directly from these streams, either in addition to or instead of the data warehouse.

StreamAWSGCPAzure
Kinesis
Pub/Sub
EventHubs

For an out-of-the-box solution to accessing this data in real-time streams, you can check out our Snowbridge tool. Alternatively, if you want to develop a custom consumer, you can leverage our Analytics SDKs to parse the event formats more easily.

note

EventHubs topics are deployed in a Kafka-compatible model, so you can consume from them using standard Kafka connector libraries.

There are four main storage options for you to select: Postgres, Redshift, Snowflake and Databricks. For Snowflake, you can choose between the newest Streaming Loader (recommended) or RDB Loader. Additionally, there is an S3 option, which is primarily used to archive enriched (and/or raw) events and to store failed events.

We recommend to only load data into a single destination, but nothing prevents you from loading into multiple destinations with the same pipeline (e.g. for testing purposes).

Set up a VPC to deploy into

AWS provides a default VPC in every region for your sub-account. Take a note of the identifiers of this VPC and the associated subnets for later parts of the deployment.

Set up Iglu Server

The first step is to set up the Iglu Server stack required by the rest of your pipeline.

This will allow you to create and evolve your own custom events and entities. Iglu Server stores the schemas for your events and entities and fetches them as your events are processed by the pipeline.

Step 1: Update the iglu_server input variables

Once you have cloned the code repository, you will need to navigate to the iglu_server directory to update the input variables in terraform.tfvars.

cd terraform/aws/iglu_server/default
nano terraform.tfvars # or other text editor of your choosing

To update your input variables, you’ll need to know a few things:

  • Your IP Address. Help.
  • A UUIDv4 to be used as the Iglu Server’s API Key. Help.
  • How to generate an SSH Key.
tip

On most systems, you can generate an SSH Key with: ssh-keygen -t rsa -b 4096. This will output where you public key is stored, for example: ~/.ssh/id_rsa.pub. You can get the value with cat ~/.ssh/id_rsa.pub.

Telemetry notice

By default, Snowplow collects telemetry data for each of the Quick Start Terraform modules. Telemetry allows us to understand how our applications are used and helps us build a better product for our users (including you!).

This data is anonymous and minimal, and since our code is open source, you can inspect what’s collected.

If you wish to help us further, you can optionally provide your email (or just a UUID) in the user_provided_id variable.

If you wish to disable telemetry, you can do so by setting telemetry_enabled to false.

See our telemetry principles for more information.

Step 2: Run the iglu_server Terraform script

You can now use Terraform to create your Iglu Server stack.

You will be asked to select a region, you can find more information about available AWS regions here.

terraform init
terraform plan
terraform apply

The deployment will take roughly 15 minutes.

Once the deployment is done, it will output iglu_server_dns_name. Make a note of this, you’ll need it when setting up your pipeline. If you have attached a custom SSL certificate and set up your own DNS records, then you don’t need this value.

Prepare the destination

Depending on the destination(s) you’ve choosen, you might need to perform a few extra steps to prepare for loading data there.

tip

Feel free to go ahead with these while your Iglu Server stack is deploying.

No extra steps needed — the necessary resources like a PostgreSQL instance, database, table and user will be created by the Terraform modules.

Set up the pipeline

In this section, you will update the input variables for the Terraform module, and then run the Terraform script to set up your pipeline. At the end you will have a working Snowplow pipeline ready to receive web, mobile or server-side data.

Step 1: Update the pipeline input variables

Navigate to the pipeline directory in the code repository and update the input variables in terraform.tfvars.

cd terraform/aws/pipeline/default
nano terraform.tfvars # or other text editor of your choosing

To update your input variables, you’ll need to know a few things:

  • Your IP Address. Help.
  • Your Iglu Server’s domain name from the previous step
  • Your Iglu Server’s API Key from the previous step
  • How to generate an SSH Key.
tip

On most systems, you can generate an SSH Key with: ssh-keygen -t rsa -b 4096. This will output where you public key is stored, for example: ~/.ssh/id_rsa.pub. You can get the value with cat ~/.ssh/id_rsa.pub.

Destination-specific variables

As mentioned above, there are several options for the pipeline’s destination database. For each destination you’d like to configure, set the <destination>_enabled variable (e.g. redshift_enabled) to true and fill all the relevant configuration options (starting with <destination>_).

When in doubt, refer back to the destination setup section where you have picked values for many of the variables.

Snowflake + Streaming Loader

If you are using Snowflake with the Streaming Loader, you will need to provide a private key you’ve generated during destination setup.

Here’s how to do it:

snowflake_streaming_loader_private_key = <<EOT
-----BEGIN PRIVATE KEY-----
MIIEvAIBADANBgkqhkiG9w0BAQEFAASCBKYwggSiAgEAAoIBAQCd2dEYSUp3hdyK
5hWwpkNGG56hLFWDK47oMf/Niu+Yh+8Wm4p9TlPje+UuKOnK5N4nAbM4hlhKyEJv
...
99Xil8uas3v7o2OSe7FfLA==
-----END PRIVATE KEY-----
EOT
caution

For all active destinations, change any _password setting to a value that only you know.

If you are using Postgres, set the postgres_db_ip_allowlist to a list of CIDR addresses that will need to access the database — this can be systems like BI Tools, or your local IP address, so that you can query the database from your laptop.

Step 2: Run the pipeline Terraform script

You will be asked to select a region, you can find more information about available AWS regions here.

terraform init
terraform plan
terraform apply

This will output your collector_dns_name, postgres_db_address, postgres_db_port and postgres_db_id.

Make a note of the outputs: you'll need them when sending events and (in some cases) connecting to your data.

Empty outputs

Depending on your cloud and chosen destination, some of these outputs might be empty — you can ignore those.

If you have attached a custom SSL certificate and set up your own DNS records, then you don't need collector_dns_name, as you will use your own DNS record to send events from the Snowplow trackers.

Terraform errors

For solutions to some common Terraform errors that you might encounter when running terraform plan or terraform apply, see the FAQs section.

Configure the destination

No extra steps needed.

Configure HTTPS (optional)

Now that you have a working pipeline, you can optionally configure your Collector and Iglu Server to have an HTTPS-enabled endpoint. This might be required in some cases to track events on strictly SSL-only websites, as well as to enable first-party tracking (by putting the Collector endpoint on the same sub-domain as your website).

  1. Navigate to Amazon Certificate Manager (ACM) in the AWS Console
  2. Request a public certificate from the ACM portal for the domain you want to host these endpoints under (e.g. for the Collector this might be c.acme.com) - make sure you are in the same region as your pipeline
  • Fully qualified domain name will be something like c.acme.com
  • Validation method is whatever works best for you - generally DNS validation is going to be easiest to action
  • Key algorithm should be left as RSA 2048
  1. Once you have requested the certificate, it should show up in the ACM certificate list as Pending validation - complete the DNS / Email validation steps and wait until the status changes to Issued
  2. Copy the issued certificate’s ARN and paste it into your terraform.tfvars file under ssl_information.certificate_arn
  3. Change ssl_information.enabled to true
  4. Apply the iglu_server / pipeline module as you have done previously to attach the certificate to the Load Balancer
  5. Add a CNAME DNS record for your requested domain pointing to the AWS Load balancer (e.g. c.acme.com -> <lb-identifier>.<region>.elb.amazonaws.com)
  • If you are using Route53 for DNS record management, you can instead setup an Alias route which can help circumvent certain CNAME cloaking tracking protections

You should now be able to access your service over HTTPS. Verify this by going to your newly set up endpoint in a browser — you should get a valid response with a valid SSL certificate.


If you are curious, here’s what has been deployed. Now it’s time to send your first events to your pipeline!