Spark module is now deprecated. Kinesis integration is not as reliable as we'd like. We suggest migrating to Flink module.
The Spark job reads bad rows from an S3 location and stores the recovered payloads in Kinesis, unrecovered and unrecoverable in other S3 buckets.
To build the fat jar, run:
Event recovery jobs are usually ran using our dataflow-runner application. A configuration templates for running is available in the repository.
To run a transient EMR cluster and execute the job using the templates, download Spark dataflow-runner templates from the repository and run:
dataflow-runner run-transient \
--emr-playbook spark-playbook.json.tmpl \
--emr-config spark-cluster.json.tmpl \
BUCKET_INPUT- S3 bucket containing bad events to be recovered (eg. geoffs-bad-events)
BUCKET_INPUT_DIRECTORY- directory in the
BUCKET_INPUTto use as the source for bad events
AWS_REGION- region in which the job is being ran (eg. eu-central-1)
AWS_SUBNET- network subnet to run the job in (eg. subnet-435010347a21886ab)
AWS_IAM_ROLE- AWS IAM role to assume while running the job (eg. geoffs-recovery-role)
AWS_KEYPAIR- AWS EC2 key pair to use for the instances (eg. geoffs-keypair)
JOB_OWNER- tag to use to mark the owner of the job (eg. goeff)
RECOVERY_VERSION- application version (eg. 0.6.0)
RECOVERY_CONFIG- recovery job config as described in Concepts
IGLU_RESOLVER- iglu resolver configuration as described in Concepts
KINESIS_OUTPUT- Kinesis stream to output recovered events to