Spark
info
This documentation only applies to Snowplow Community Edition. See the feature comparison page for more information about the different Snowplow offerings.
danger
Spark module is now deprecated. Kinesis integration is not as reliable as we'd like. We suggest migrating to Flink module.
The Spark job reads bad rows from an S3 location and stores the recovered payloads in Kinesis, unrecovered and unrecoverable in other S3 buckets.
Building
To build the fat jar, run:
sbt spark/assembly
Running
Event recovery jobs are usually ran using our dataflow-runner application. A configuration templates for running is available in the repository.
To run a transient EMR cluster and execute the job using the templates, download Spark dataflow-runner templates from the repository and run:
dataflow-runner run-transient \
--emr-playbook spark-playbook.json.tmpl \
--emr-config spark-cluster.json.tmpl \
--vars bucket,$BUCKET_INPUT,region,$AWS_REGION,subnet,$AWS_SUBNET,role,$AWS_IAM_ROLE,keypair,$AWS_KEYPAIR,client,$JOB_OWNER,version,$RECOVERY_VERSION,config,$RECOVERY_CONFIG,resolver,$IGLU_RESOLVER,output,$KINESIS_OUTPUT,inputdir,$BUCKET_INPUT_DIRECTORY
Where:
BUCKET_INPUT
- S3 bucket containing bad events to be recovered (eg. geoffs-bad-events)BUCKET_INPUT_DIRECTORY
- directory in theBUCKET_INPUT
to use as the source for bad eventsAWS_REGION
- region in which the job is being ran (eg. eu-central-1)AWS_SUBNET
- network subnet to run the job in (eg. subnet-435010347a21886ab)AWS_IAM_ROLE
- AWS IAM role to assume while running the job (eg. geoffs-recovery-role)AWS_KEYPAIR
- AWS EC2 key pair to use for the instances (eg. geoffs-keypair)JOB_OWNER
- tag to use to mark the owner of the job (eg. goeff)RECOVERY_VERSION
- application version (eg. 0.6.0)RECOVERY_CONFIG
- recovery job config as described in ConfigurationIGLU_RESOLVER
- iglu resolver configuration as described in ConfigurationKINESIS_OUTPUT
- Kinesis stream to output recovered events to