Skip to main content


This documentation only applies to Snowplow Community Edition. See the feature comparison page for more information about the different Snowplow offerings.

Spark module is now deprecated. Kinesis integration is not as reliable as we'd like. We suggest migrating to Flink module.

The Spark job reads bad rows from an S3 location and stores the recovered payloads in Kinesis, unrecovered and unrecoverable in other S3 buckets.


To build the fat jar, run:

sbt spark/assembly


Event recovery jobs are usually ran using our dataflow-runner application. A configuration templates for running is available in the repository.

To run a transient EMR cluster and execute the job using the templates, download Spark dataflow-runner templates from the repository and run:

dataflow-runner run-transient \
--emr-playbook spark-playbook.json.tmpl \
--emr-config spark-cluster.json.tmpl \


  • BUCKET_INPUT - S3 bucket containing bad events to be recovered (eg. geoffs-bad-events)
  • BUCKET_INPUT_DIRECTORY - directory in the BUCKET_INPUT to use as the source for bad events
  • AWS_REGION - region in which the job is being ran (eg. eu-central-1)
  • AWS_SUBNET - network subnet to run the job in (eg. subnet-435010347a21886ab)
  • AWS_IAM_ROLE - AWS IAM role to assume while running the job (eg. geoffs-recovery-role)
  • AWS_KEYPAIR - AWS EC2 key pair to use for the instances (eg. geoffs-keypair)
  • JOB_OWNER - tag to use to mark the owner of the job (eg. goeff)
  • RECOVERY_VERSION - application version (eg. 0.6.0)
  • RECOVERY_CONFIG - recovery job config as described in Configuration
  • IGLU_RESOLVER - iglu resolver configuration as described in Configuration
  • KINESIS_OUTPUT - Kinesis stream to output recovered events to
Was this page helpful?