Quick start guide
This guide will take you through how to spin up a Snowplow Community Edition pipeline using the Snowplow Terraform modules. (Not familiar with Terraform? Take a look at Infrastructure as code with Terraform.)
Prerequisites
Sign up for Snowplow Community Edition and follow the link in the email to get a copy of the repository containing the Terraform code.
Install Terraform 1.0.0 or higher. Follow the instructions to make sure the terraform
binary is available on your PATH
. You can also use tfenv to manage your Terraform installation.
- AWS
- GCP
- Azure
Install AWS CLI version 2.
Configure the CLI against a role that has the AdminstratorAccess
policy attached.
AdminstratorAccess
allows all actions on all AWS services and shouldn't be used in production
Details on how to configure the AWS Terraform Provider can be found on the registry.
Install Google Cloud SDK.
Make sure the following APIs are active in your GCP account (this list might not be exhaustive and is subject to change as GCP APIs evolve):
- Compute Engine API
- Cloud Resource Manager API
- Identity and Access Management (IAM) API
- Cloud Pub/Sub API
- Cloud SQL Admin API
Configure a Google Cloud service account. See details on using the service account with the Cloud SDK. You will need to:
- Navigate to your service account on Google Cloud Console
- Create a new JSON Key and store it locally
- Create the environment variable by running
export GOOGLE_APPLICATION_CREDENTIALS="KEY PATH"
in your terminal
Install the Azure CLI.
If your organisation has an existing Azure account, make sure your user has been granted the following roles on a valid Azure Subscription:
User Access Administrator
allows the user to modify, create and delete permissions across Azure resources, and shouldn’t be used in production. Instead, you can use a custom role with the following permissions:
Microsoft.Authorization/roleAssignments/write
to deploy the stacks belowMicrosoft.Authorization/roleAssignments/delete
to destroy them
If you don’t have an Azure account yet, you can get started with a new pay-as-you-go account.
Details on how to configure the Azure Terraform Provider can be found on the registry.
Storage options
The sections below will guide you through setting up your destination to receive Snowplow data, but for now here is an overview.
Warehouse | AWS | GCP | Azure |
---|---|---|---|
Postgres | ✅ | ✅ | ❌ |
Snowflake | ✅ | ❌ | ✅ |
Databricks | ✅ | ❌ | ✅ |
Redshift | ✅ | — | — |
BigQuery | — | ✅ | — |
Synapse Analytics | — | — | ✅ |
Real-time streaming options
As part of the deployment, your data will be available in real-time streams corresponding to the cloud provider you have chosen. You can consume data directly from these streams, either in addition to or instead of the data warehouse.
Stream | AWS | GCP | Azure |
---|---|---|---|
Kinesis | ✅ | ❌ | ❌ |
Pub/Sub | ❌ | ✅ | ❌ |
EventHubs | ❌ | ❌ | ✅ |
For an out-of-the-box solution to accessing this data in real-time streams, you can check out our Snowbridge tool. Alternatively, if you want to develop a custom consumer, you can leverage our Analytics SDKs to parse the event formats more easily.
EventHubs topics are deployed in a Kafka-compatible model, so you can consume from them using standard Kafka connector libraries.
- AWS
- GCP
- Azure
There are four main storage options for you to select: Postgres, Redshift, Snowflake and Databricks. For Snowflake, you can choose between the newest Streaming Loader (recommended) or RDB Loader. Additionally, there is an S3 option, which is primarily used to archive enriched (and/or raw) events and to store failed events.
We recommend to only load data into a single destination, but nothing prevents you from loading into multiple destinations with the same pipeline (e.g. for testing purposes).
There are two alternative storage options for you to select: Postgres and BigQuery.
We recommend to only load data into a single destination, but nothing prevents you from loading into multiple destinations with the same pipeline (e.g. for testing purposes).
There are two storage options for you to select: Snowflake and data lake (ADLS). The latter option enables querying data from Databricks and Synapse Analytics.
We recommend to only load data into a single destination (Snowflake or data lake), but nothing prevents you from loading into both with the same pipeline (e.g. for testing purposes).
Set up a VPC to deploy into
- AWS
- GCP
- Azure
AWS provides a default VPC in every region for your sub-account. Take a note of the identifiers of this VPC and the associated subnets for later parts of the deployment.
GCP provides a default VPC for your project. In the steps below, it is sufficient to set network = default
and leave subnetworks empty, and Terraform will discover the correct network to deploy into.
Azure does not provide a default VPC or resource group, so we have added a helper module to create a working network we can deploy into.
To use our out-of-the-box network, you will need to navigate to the terraform/azure/base
directory in the code repository and update the input variables in terraform.tfvars
.
Once that’s done, you can use Terraform to create your base network.
terraform init
terraform plan
terraform apply
After the deployment completes, you should get an output like this:
...
vnet_subnets_name_id = {
"collector-agw1" = "/subscriptions/<...>/resourceGroups/<...>/providers/Microsoft.Network/virtualNetworks/<...>/subnets/collector-agw1"
"iglu-agw1" = "/subscriptions/<...>/resourceGroups/<...>/providers/Microsoft.Network/virtualNetworks/<...>/subnets/iglu-agw1"
"iglu1" = "/subscriptions/<...>/resourceGroups/<...>/providers/Microsoft.Network/virtualNetworks/<...>/subnets/iglu1"
"pipeline1" = "/subscriptions/<...>/resourceGroups/<...>/providers/Microsoft.Network/virtualNetworks/<...>/subnets/pipeline1"
}
These are the subnet identifiers, e.g. "/subscriptions/<...>/resourceGroups/<...>/providers/Microsoft.Network/virtualNetworks/<...>/subnets/pipeline1"
is the identifier of the pipeline1
subnet. Take note of these four identifiers, as you will need them in the following steps.
Set up Iglu Server
The first step is to set up the Iglu Server stack required by the rest of your pipeline.
This will allow you to create and evolve your own custom events and entities. Iglu Server stores the schemas for your events and entities and fetches them as your events are processed by the pipeline.
Step 1: Update the iglu_server
input variables
Once you have cloned the code repository, you will need to navigate to the iglu_server
directory to update the input variables in terraform.tfvars
.
- AWS
- GCP
- Azure
cd terraform/aws/iglu_server/default
nano terraform.tfvars # or other text editor of your choosing
cd terraform/gcp/iglu_server/default
nano terraform.tfvars # or other text editor of your choosing
cd terraform/azure/iglu_server
nano terraform.tfvars # or other text editor of your choosing
If you used our base
module, you will need to set these variables as follows:
resource_group_name
: use the same value as you supplied inbase
subnet_id_lb
: use the identifier of theiglu-agw1
subnet frombase
subnet_id_servers
: use the identifier of theiglu1
subnet frombase
To update your input variables, you’ll need to know a few things:
- Your IP Address. Help.
- A UUIDv4 to be used as the Iglu Server’s API Key. Help.
- How to generate an SSH Key.
On most systems, you can generate an SSH Key with: ssh-keygen -t rsa -b 4096
. This will output where you public key is stored, for example: ~/.ssh/id_rsa.pub
. You can get the value with cat ~/.ssh/id_rsa.pub
.
Telemetry notice
Step 2: Run the iglu_server
Terraform script
You can now use Terraform to create your Iglu Server stack.
- AWS
- GCP
- Azure
You will be asked to select a region, you can find more information about available AWS regions here.
terraform init
terraform plan
terraform apply
The deployment will take roughly 15 minutes.
terraform init
terraform plan
terraform apply
terraform init
terraform plan
terraform apply
Once the deployment is done, it will output iglu_server_dns_name
. Make a note of this, you’ll need it when setting up your pipeline. If you have attached a custom SSL certificate and set up your own DNS records, then you don’t need this value.
Prepare the destination
Depending on the destination(s) you’ve choosen, you might need to perform a few extra steps to prepare for loading data there.
Feel free to go ahead with these while your Iglu Server stack is deploying.
- Postgres
- Redshift
- BigQuery
- Snowflake
- Databricks
- Synapse Analytics
No extra steps needed — the necessary resources like a PostgreSQL instance, database, table and user will be created by the Terraform modules.
Assuming you already have an active Redshift cluster, execute the following SQL (replace the ${...}
variables with your desired values). You will need the permissions to create databases, users and schemas in the cluster.
-- 1. (Optional) Create a new database - you can also use an existing one if you prefer
CREATE DATABASE ${redshift_database};
-- Log back into Redshift with the new database:
-- psql --host <host> --port <port> --username <admin> --dbname ${redshift_database}
-- 2. Create a schema within the database
CREATE SCHEMA IF NOT EXISTS ${redshift_schema};
-- 3. Create the loader user
CREATE USER ${redshift_loader_user} WITH PASSWORD '${redshift_password}';
-- 4. Ensure the schema is owned by the loader user
ALTER SCHEMA ${redshift_schema} OWNER TO ${redshift_loader_user};
You will need to ensure that the loader can access the Redshift cluster over whatever port is configured for the cluster (usually, 5439
).
No extra steps needed.
If you are going to use the Snowflake Streaming Loader (currently, only provided for AWS), you will need to generate a key pair following the Snowflake documentation. Make sure to enter an empty passphrase, as the terraform module below does not support keys with passphrases (for simplicity).
If you are not using the Snowflake Streaming Loader, you will need to pick a password.
Execute the following SQL (replace the ${...}
variables with your desired values). You will need access to both SYSADMIN
and SECURITYADMIN
level roles to action this:
-- 1. (Optional) Create a new database - you can also use an existing one if you prefer
CREATE DATABASE IF NOT EXISTS ${snowflake_database};
-- 2. Create a schema within the database
CREATE SCHEMA IF NOT EXISTS ${snowflake_database}.${snowflake_schema};
-- 3. Create a warehouse which will be used to load data
CREATE WAREHOUSE IF NOT EXISTS ${snowflake_warehouse} WITH WAREHOUSE_SIZE = 'XSMALL' WAREHOUSE_TYPE = 'STANDARD' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;
-- 4. Create a role that will be used for loading data
CREATE ROLE IF NOT EXISTS ${snowflake_loader_role};
GRANT USAGE, OPERATE ON WAREHOUSE ${snowflake_warehouse} TO ROLE ${snowflake_loader_role};
GRANT USAGE ON DATABASE ${snowflake_database} TO ROLE ${snowflake_loader_role};
GRANT ALL ON SCHEMA ${snowflake_database}.${snowflake_schema} TO ROLE ${snowflake_loader_role};
-- 5. Create a user that can be used for loading data
CREATE USER IF NOT EXISTS ${snowflake_loader_user}
RSA_PUBLIC_KEY='MIIBIj...' -- fill out if using Snowflake Streaming Loader
PASSWORD='...' -- fill out otherwise
MUST_CHANGE_PASSWORD = FALSE
DEFAULT_ROLE = ${snowflake_loader_role}
EMAIL = 'loader@acme.com';
GRANT ROLE ${snowflake_loader_role} TO USER ${snowflake_loader_user};
-- 6. (Optional) Grant this role to SYSADMIN to make debugging easier from admin users
GRANT ROLE ${snowflake_loader_role} TO ROLE SYSADMIN;
On Azure, we currently support loading data into Databricks via a data lake. You can still follow Step 1 below to create the cluster, however you should skip the rest of these steps. Instead, proceed with deploying the pipeline — we will return to configuring Databricks at the end of this guide.
Step 1: Create a cluster
The cluster spec described below should be sufficient for a monthly event volume of up to 10 million events. If your event volume is greater, then you may need to increase the size of the cluster.
Create a new cluster, following the Databricks documentation, with the following settings:
- the runtime version must be 13.0 or greater (but not 13.1 or 13.2)
- single node cluster
- "smallest" size node type
- auto-terminate after 30 minutes.
Advanced cluster configuration (optional)
Step 2: Note the JDBC connection details
- In the Databricks UI, click on "Compute" in the sidebar
- Click on the RDB Loader cluster and navigate to "Advanced options"
- Click on the "JDBC/ODBC" tab
- Note down the JDBC connection URL, specifically the
host
, theport
and thehttp_path
Step 3: Create an access token for the loader
The access token must not have a specified lifetime. Otherwise, the loader will stop working when the token expires.
- Navigate to the user settings in your Databricks workspace
- For Databricks hosted on AWS, the "Settings" link is in the lower left corner in the side panel
- For Databricks hosted on Azure, "User Settings" is an option in the drop-down menu in the top right corner.
- Go to the "Access Tokens" tab
- Click the "Generate New Token" button
- Optionally enter a description (comment). Leave the expiration period empty
- Click the "Generate" button
- Copy the generated token and store it in a secure location
Step 4: Create the catalog and the schema
Execute the following SQL (replace the ${...}
variables with your desired values). The default catalog is called hive_metastore
and is what you should use in the loader unless you specify your own.
-- USE CATALOG ${catalog_name}; -- Uncomment if your want to use a custom Unity catalog and replace with your own value.
CREATE SCHEMA IF NOT EXISTS ${schema_name}
-- LOCATION s3://<custom_location>/ -- Uncomment if you want tables created by Snowplow to be located in a non-default bucket or directory.
;
Advanced security configuration (optional)
No extra steps needed. Proceed with deploying the pipeline — we will return to configuring Synapse at the end of this guide.
Set up the pipeline
In this section, you will update the input variables for the Terraform module, and then run the Terraform script to set up your pipeline. At the end you will have a working Snowplow pipeline ready to receive web, mobile or server-side data.
Step 1: Update the pipeline
input variables
Navigate to the pipeline
directory in the code repository and update the input variables in terraform.tfvars
.
- AWS
- GCP
- Azure
cd terraform/aws/pipeline/default
nano terraform.tfvars # or other text editor of your choosing
cd terraform/gcp/pipeline/default
nano terraform.tfvars # or other text editor of your choosing
cd terraform/azure/pipeline
nano terraform.tfvars # or other text editor of your choosing
If you used our base
module, you will need to set these variables as follows:
resource_group_name
: use the same value as you supplied inbase
subnet_id_lb
: use the identifier of thecollector-agw1
subnet frombase
subnet_id_servers
: use the identifier of thepipeline1
subnet frombase
Confluent Cloud
To update your input variables, you’ll need to know a few things:
- Your IP Address. Help.
- Your Iglu Server’s domain name from the previous step
- Your Iglu Server’s API Key from the previous step
- How to generate an SSH Key.
On most systems, you can generate an SSH Key with: ssh-keygen -t rsa -b 4096
. This will output where you public key is stored, for example: ~/.ssh/id_rsa.pub
. You can get the value with cat ~/.ssh/id_rsa.pub
.
Destination-specific variables
- AWS
- GCP
- Azure
As mentioned above, there are several options for the pipeline’s destination database. For each destination you’d like to configure, set the <destination>_enabled
variable (e.g. redshift_enabled
) to true
and fill all the relevant configuration options (starting with <destination>_
).
When in doubt, refer back to the destination setup section where you have picked values for many of the variables.
Snowflake + Streaming Loader
For all active destinations, change any _password
setting to a value that only you know.
If you are using Postgres, set the postgres_db_ip_allowlist
to a list of CIDR addresses that will need to access the database — this can be systems like BI Tools, or your local IP address, so that you can query the database from your laptop.
As mentioned above, there are two options for pipeline’s destination database. For each destination you’d like to configure, set the <destination>_enabled
variable (e.g. postgres_db_enabled
) to true
and fill all the relevant configuration options (starting with <destination>_
).
Change the postgres_db_password
setting to a value that only you know.
Set the postgres_db_authorized_networks
to a list of CIDR addresses that will need to access the database — this can be systems like BI Tools, or your local IP address, so that you can query the database from your laptop.
As mentioned above, there are two options for the pipeline’s destination: Snowflake and data lake (the latter enabling Databricks and Synapse Analytics). For each destination you’d like to configure, set the <destination>_enabled
variable (e.g. snowflake_enabled
) to true
and fill all the relevant configuration options (starting with <destination>_
).
When in doubt, refer back to the destination setup section where you have picked values for many of the variables.
If loading into Snowflake, change the snowflake_loader_password
setting to a value that only you know.
Step 2: Run the pipeline
Terraform script
- AWS
- GCP
- Azure
You will be asked to select a region, you can find more information about available AWS regions here.
terraform init
terraform plan
terraform apply
This will output your collector_dns_name
, postgres_db_address
, postgres_db_port
and postgres_db_id
.
terraform init
terraform plan
terraform apply
This will output your collector_ip_address
, postgres_db_address
, postgres_db_port
, bigquery_db_dataset_id
, bq_loader_dead_letter_bucket_name
and bq_loader_bad_rows_topic_name
.
terraform init
terraform plan
terraform apply
This will output your collector_lb_ip_address
and collector_lb_fqdn
.
Make a note of the outputs: you'll need them when sending events and (in some cases) connecting to your data.
Depending on your cloud and chosen destination, some of these outputs might be empty — you can ignore those.
If you have attached a custom SSL certificate and set up your own DNS records, then you don't need collector_dns_name
, as you will use your own DNS record to send events from the Snowplow trackers.
For solutions to some common Terraform errors that you might encounter when running terraform plan
or terraform apply
, see the FAQs section.
Configure the destination
- Postgres
- Redshift
- BigQuery
- Snowflake
- Databricks
- Synapse Analytics
No extra steps needed.
No extra steps needed.
No extra steps needed.
No extra steps needed.
On Azure, we currently support loading data into Databricks via a data lake. To complete the setup, you will need to configure Databricks to access your data on ADLS.
First, follow the Databricks documentation to set up authentication using either Azure service principal, shared access signature tokens or account keys. (The latter mechanism is not recommended, but is arguably the easiest for testing purposes.)
You will need to know a couple of things:
- Storage account name — this is the value of the
storage_account_name
variable in the pipelineterraform.tvars
file - Storage container name —
lake-container
Once authentication is set up, you can create an external table using Spark SQL (replace <storage-account-name>
with the corredponding value):
CREATE TABLE events
LOCATION 'abfss://lake-container@<storage-account-name>.dfs.core.windows.net/events/';
Your data is loaded into ADLS. To access it, follow the Synapse documentation and use the OPENROWSET
function.
You will need to know a couple of things:
- Storage account name — this is the value of the
storage_account_name
variable in the pipelineterraform.tvars
file - Storage container name —
lake-container
Example query
We recommend creating a data source, which simplifies future queries (note that unlike the previous URL, this one does not end with /events/
):
CREATE EXTERNAL DATA SOURCE SnowplowData
WITH (LOCATION = 'https://<storage-account-name>.blob.core.windows.net/lake-container/');
Example query with data source
You can also consume your ADLS data via Fabric and OneLake:
- First, create a Lakehouse or use an existing one.
- Next, create a OneLake shortcut to your storage account. In the URL field, specify
https://<storage-account-name>.blob.core.windows.net/lake-container/events/
. - You can now use Spark notebooks to explore your Snowplow data.
Do note that currently not all Fabric services support nested fields present in the Snowplow data.
Configure HTTPS (optional)
Now that you have a working pipeline, you can optionally configure your Collector and Iglu Server to have an HTTPS-enabled endpoint. This might be required in some cases to track events on strictly SSL-only websites, as well as to enable first-party tracking (by putting the Collector endpoint on the same sub-domain as your website).
- AWS
- GCP
- Azure
- Navigate to Amazon Certificate Manager (ACM) in the AWS Console
- Request a public certificate from the ACM portal for the domain you want to host these endpoints under (e.g. for the Collector this might be
c.acme.com
) - make sure you are in the same region as your pipeline
Fully qualified domain name
will be something likec.acme.com
Validation method
is whatever works best for you - generallyDNS validation
is going to be easiest to actionKey algorithm
should be left asRSA 2048
- Once you have requested the certificate, it should show up in the ACM certificate list as
Pending validation
- complete the DNS / Email validation steps and wait until the status changes toIssued
- Copy the issued certificate’s
ARN
and paste it into yourterraform.tfvars
file underssl_information.certificate_arn
- Change
ssl_information.enabled
totrue
- Apply the
iglu_server
/pipeline
module as you have done previously to attach the certificate to the Load Balancer - Add a
CNAME
DNS record for your requested domain pointing to the AWS Load balancer (e.g.c.acme.com
-><lb-identifier>.<region>.elb.amazonaws.com
)
- If you are using
Route53
for DNS record management, you can instead setup an Alias route which can help circumvent certainCNAME
cloaking tracking protections
You should now be able to access your service over HTTPS. Verify this by going to your newly set up endpoint in a browser — you should get a valid response with a valid SSL certificate.
- Navigate to Google Certificate Manager in the GCP Console
- Select the
Classic Certificates
tab and then pressCreate SSL Certificate
Name
should be something unique for the certificate you are creatingDescription
describe the service you are creating the certificate forCreate mode
allows you to either supply your own certificate or create aGoogle-managed
one. We generally recommend the latter for ease of implementation and updatesDomains
add the domain you want to host the service under (e.g. for the Collector this might bec.acme.com
)
- Add an
A
DNS record for the requested domain pointing to the GCP Load balancer (e.g.c.acme.com
-><lb-ipv4-address>
) - Copy the Certificate ID into your
terraform.tfvars
file underssl_information.certificate_id
- The ID takes the form
projects/{{project}}/global/sslCertificates/{{name}}
- Change
ssl_information.enabled
totrue
- Apply the
iglu_server
/pipeline
module as you have done previously to attach the certificate to the Load Balancer
Once the certificate is issued, you should now be able to access your service over HTTPS. Verify this by going to your newly set up endpoint in a browser — you should get a valid response with a valid SSL certificate.
It’s worth noting that GCP Managed Certificates can take up to 24 hours to be provisioned successfully, so it might take a while before you can access your service over HTTPS.
Unlike AWS or GCP, Azure doesn't have a Certificate Management layer. So to follow this guide, you will need to have bought your own valid SSL certificate covering the domain(s) you want to bind to.
- Convert your SSL Certificate into the correct format (
pkcs12
) needed for the Azure Application Load Balancer - Copy the
pkcs12
certificate password into yourterraform.tfvars
file underssl_information.password
- Copy the
pkcs12
certificate into yourterraform.tfvars
file underssl_information.data
- Ensure the certificate is
base64
-encoded (e.g.cat cert.p12 | base64
)
- Change
ssl_information.enabled
totrue
- Apply the
iglu_server
/pipeline
module as you have done previously to attach the certificate to the Load Balancer - Add an
A
DNS record for your requested domain pointing to the Azure Load Balancer (e.g.c.acme.com
-><lb-ipv4-address>
)
You should now be able to access your service over HTTPS. Verify this by going to your newly set up endpoint in a browser — you should get a valid response with a valid SSL certificate.
If you are curious, here’s what has been deployed. Now it’s time to send your first events to your pipeline!