Skip to main content

Setup Snowplow Open Source on AWS

Quick Start

We have built a set of terraform modules, which automates the setting up & deployment of the required infrastructure & applications for an operational Snowplow open source pipeline, with just a handful of input variables required on your side.ย 

Quick Start Installation Guide on AWS

Installation Guideโ€‹


  1. Setup your AWS environment
  2. Setup a Snowplow Collector
  3. Setup one or more sources using trackers or webhooks
  4. Setup Enrich
  5. Setup alternative data stores (e.g. Redshift, PostgreSQL)
  6. Data modeling in Redshift
  7. Analyze your data!

Setup a Snowplow Collectorโ€‹

The Snowplow collector receives data from Snowplow trackers and webhooks, and writes them to a stream for further processing. Setting up a collector is the first step in the Snowplow setup process.

Setup a Data Sourceโ€‹

Snowplow supports two types of data sources: trackers (SDKs) for integrating your own apps, and webhooks for ingesting data from third parties via HTTP(S).

Setup a Snowplow Trackerโ€‹

Snowplow trackers generate event data and send that data to Snowplow collectors to be captured. The most popular Snowplow tracker to-date is the JavaScript Tracker, which is integrated in websites (either directly or via a tag management solution) the same way that any web analytics tracker (e.g. Google Analytics or Omniture tags) is integrated.

Setup a Third-Party Webhookโ€‹

Snowplow allows you to collect events via the webhooks of supported third-party software.

Webhooks allow this third-party software to send their own internal event streams to Snowplow collectors to be captured. Webhooks are sometimes referred to as "streaming APIs" or "HTTP response APIs".

Note: once you have setup a collector and tracker or webhook, you can pause and perform the remainder of the setup steps later. That is because your data is being successfully generated and logged. When you eventually setup enrich, you will be able to process all the data you have logged since setup.

Setup Enrichโ€‹

The Snowplow enrichment process processes raw events from a collector and

  1. Cleans up the data into a format that is easier to parse / analyse
  2. Enriches the data (e.g. infers the location of the visitor from his / her IP address and infers the search engine keywords from the query string)
  3. Stores the cleaned, enriched data

Once you have setup Enrich, the process for taking the raw data generated by the collector, cleaning and enriching it will be automated.

Setup alternative data stores (e.g. Redshift, SnowflakeDB, Elastic, Indicative)โ€‹

Most Snowplow users store their web event data in at least two places: S3 for processing in Spark (e.g. to enable machine learning via MLLib) and a database (e.g. Redshift) for more traditional OLAP analysis.

The RDB Loader is an EMR step to regularly transfer data from S3 into other databases e.g. Redshift. If you only wish to process your data using Spark on EMR, you do not need to setup the RDB Loader. However, if you would find it convenient to have your data in another data store (e.g. Redshift) then you can set this up at this stage.

Data modelingโ€‹

Once your data is stored in S3 and Redshift, the basic setup is complete. You now have access to the event stream: a long list of packets of data, where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:

  1. Join event-level data with other data sets (e.g. customer data)
  2. Aggregate event-level data into smaller data sets (e.g. sessions)
  3. Apply business logic (e.g. user segmentation)

We call this process data modeling.

Analyse your data!โ€‹

Now that data is stored in S3 and potentially also Redshift, you are in a position to start analyzing the event stream or data from the derived tables in Redshift, if a data model has been built. As part of the setup guide we run through the steps necessary to perform some initial analysis and plugin a couple of analytics tools, to get you started.

You now have all six Snowplow subsystems working! The Snowplow setup is complete!