Snowplow event extractor
Overview
Azure Data Lake is a secure and scalable data storage and analytics service. Azure Data Lake Analytics includes U-SQL, a big-data query language for writing queries that analyze data.
Event Extractor
Snowplow Event Extractor is an ADLA custom extractor that allows you to parse Snowplow enriched events. Snowplow’s enrichment process outputs enriched events in a TSV format consisting of 131 fields.
EventExtractor implements IExtractor interface:
[SqlUserDefinedExtractor]
public class EventExtractor : IExtractor
{
private static readonly string ROW_DELIMITER = '\t';
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
//split the input based on ROW_DELIMITER
//set the output data on the output object
//EventExtractor only outputs columns and values that are defined with the output.
}
}
Usage
Following is base U-SQL script that uses a Event Extractor:
DECLARE @input_file string = @"\snowplow\event.tsv";
@rs0 =
EXTRACT
app_id string,
platform string
FROM @input_file
USING new Snowplow.EventExtractor();
The most complex piece of processing is the handling of the self-describing JSONs found in the enriched event's unstruct_event, contexts and derived_contexts fields.
Consider contexts found in the tsv:
{
'schema': 'iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0',
'data': [{
'schema': 'iglu:org.schema/WebPage/jsonschema/1-0-0',
'data': {
'genre': 'blog',
'inLanguage': 'en-US',
'datePublished': '2014-11-06T00:00:00Z',
'author': 'Devesh Shetty',
'breadcrumb': ['blog', 'releases']
}
}, {
'schema': 'iglu:org.w3/PerformanceTiming/jsonschema/1-0-0',
'data': {
'navigationStart': 1415358089861,
'unloadEventStart': 1415358090270,
'unloadEventEnd': 1415358090287,
'redirectStart': 0,
'redirectEnd': 0
}
}]
}
One of the ways to fetch data from context would be to use user-defined function(UDF):
DECLARE @input_file string = @"\snowplow\event.tsv";
//extract context from tsv
@rs0 =
EXTRACT
context string
FROM @input_file
USING new Snowplow.EventExtractor();
/*
context has nested data array
*/
@parseData =
SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(context, "data[*]").Values AS data_arr,
FROM @rs0;
/*
The nested data array inside context consists of an array from which we parse the inner data field
*/
@parseGenre =
SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(data_arr, "$.data.genre").Values AS genre,
FROM @parseData;
The above process can get quite complex.
So to abstract away the complexity, Snowplow Event Extractor follows a simple mapping:
DECLARE @input_file string = @"\snowplow\event.tsv";
//extract genre from context directly
@rsGenre =
EXTRACT
context.data.genre
FROM @input_file
USING new Snowplow.EventExtractor();