Events Manifest Populator
Overview
Event Manifest Populator is an Apache Spark job allowing you to backpopulate a Snowplow event manifest in DynamoDB with the metadata of some or all enriched events from your archive in S3.
This one-off job solves the "cold start" problem for identifying cross-batch natural deduplicates in Snowplow's Relational Database Shredder step.
In other words, without running this job you still will be able to deduplicate events across batches, but if Relational Database Shredder encounters duplicate of event that was shredded before you enabled cross-batch deduplication it will land into shredded/good
.
Usage
In order to use Event Manifest Populator, you need to have boto2 installed:
$ pip install boto
As next step you need to grab run.py
file with instructions to run job on AWS EMR.
You can do it by downloading it directly from Github:
$ wget https://raw.githubusercontent.com/snowplow/snowplow/master/5-data-modeling/event-manifest-populator/run.py
Now you can run Event Manifest Populator with a single command (inside a directory with run.py
):
$ python run.py $ENRICHED_ARCHIVE_S3_PATH $STORAGE_CONFIG_PATH $IGLU_RESOLVER_PATH
Task has three required arguments:
- Path to enriched events archive. It can be found in
aws.s3.buckets.enriched.archive
setting in your config.yml. - Local path to Duplicate storage configuration JSON.
- Local path to Iglu resolver configuration JSON.
Optionally, you can also pass following arguments:
--since
to reduce amount of data to be stored in DynamodDB.
If this option was passed Manifest Populator will process enriched events only after specified date.
Input date supports two formats:YYYY-MM-dd
andYYYY-MM-dd-HH-mm-ss
.--log-path
to store EMR job logs on S3. Normally, Manifest Populator does not
produce any logs or output, but if some error occured you'll be able to
inspect it in EMR logs stored in this path.--profile
to specify AWS profile to create this EMR job.--jar
to specify S3 path to custom JAR
Duplicate storage configuration JSON
The configuration JSON should conform to the amazon_dynamodb_config
JSON Schema.
The properties of the schema are:
name
: a descriptive name for this Snowplow storage targetaccessKeyId
: AWS Access Key IdsecretAccessKey
: AWS Secret Access KeyawsRegion
: AWS regiondynamodbTable
: DynamoDB table to store information about processed eventspurpose
: common for all targets. Amazon DynamoDB supports only"DUPLICATE_TRACKING"
id
: (optional) machine-readable config id
Note that Event Manifest Populator must be used only with run ids produced with version of snowplow newer than R73 Cuban Macaw as format of TSV files has been changed.