Skip to main content

Run EmrEtlRunner


Run commandโ€‹

The most useful command is the run command which allows you to actually run your EMR job:

./snowplow-emr-etl-runner run

The available options are as follows:

Usage: run [options]
-c, --config CONFIG configuration file
-n, --enrichments ENRICHMENTS enrichments directory
-r, --resolver RESOLVER Iglu resolver file
-t, --targets TARGETS targets directory
-d, --debug enable EMR Job Flow debugging
-f {enrich,shred,elasticsearch,archive_raw,rdb_load,analyze,archive_enriched,archive_shredded,staging_stream_enrich},
--resume-from resume from the specified step
-x {staging,enrich,shred,elasticsearch,archive_raw,rdb_load,consistency_check,analyze,load_manifest_check,archive_enriched,archive_shredded,staging_stream_enrich},
--skip skip the specified step(s)
-i, --include {vacuum} include additional step(s)
-l, --lock PATH where to store the lock
--ignore-lock-on-start ignore the lock if it is set when starting
--consul ADDRESS address to the Consul server

Note that the config and resolver options are mandatory.

Note that in Stream Enrich mode you cannot skip nor resume from staging, enrich and archive_raw. Instead of staging and enrich, in Stream Enrich mode single special staging_stream_enrich is used.

2.2 Lint commandsโ€‹

Other useful commands include the lint commands which allows you to check the validity of your resolver or enrichments with respect to their respective schemas.

If you want to lint your resolver:

./snowplow-emr-etl-runner lint resolver

The mandatory options are:

Usage: lint resolver [options]
-r, --resolver RESOLVER Iglu resolver file

If you want to lint your enrichments:

./snowplow-emr-etl-runner lint enrichments

The mandatory options are:

Usage: lint enrichments [options]
-r, --resolver RESOLVER Iglu resolver file
-n, --enrichments ENRICHMENTS enrichments directory

Checking the resultsโ€‹

Once you have run the EmrEtlRunner you should be able to manually inspect in S3 the folder specified in the :out: parameter in your config.yml file and see new files generated, which will contain the cleaned data either for uploading into a storage target (e.g. Redshift) or for analysing directly using Hive or Spark or some other querying tool on EMR.

Note: most Snowplow users run the 'spark' version of the ETL process, in which case the data generated is saved into subfolders with names of the form part-000.... If, however, you are running the legacy 'hive' ETL (because e.g. you want to use Hive as your storage target, rather than Redshift, which is the only storage target the 'spark' etl currently supports), the subfolders names will be of the format dt=....

Next stepsโ€‹

Comfortable using EmrEtlRunner? Then schedule it so that it regularly takes new data generated by stream enrich, shreds it (using RDB Shredder) and loaders it into Redshift (using RDB Loader).