Skip to main content
Release Version Early Release Snowplow Personal and Academic License
danger

Make sure you upgrade to v.0.4.0 of Unified Package if you would like to use the v0.2.0 of Attribution Package

Snowplow Attribution Package

The package source code can be found in the snowplow/dbt-snowplow-attribution repo, and the docs for the macro design are here.

tip

Although there is no direct dependency, by default the attribution package is dependent on the snowplow_unified_views as well as the snowplow_unified_conversions table created by the snowplow_unified dbt package. As long as the field names are matched you can use other source tables, too.

Overview

​ The Snowplow Attribution dbt package produces a set of incremental derived tables to provide a basis for in-depth Marketing Attribution Analysis on an ongoing basis. It allows you to attribute the value of a conversion to one or more channels and campaigns depending on the conversion pathway. As a result, it becomes possible to determine the revenue per pathway (channel or campaign), as well as ROAS (Return On Ad Spend - the amount of revenue that is earned for every dollar spent on advertising) once you have cost data for each marketing channel or campaign.

This package consists of a series of dbt models that produce the following tables:

  • snowplow_attribution_paths_to_conversion: Customer id and the paths the customer has followed that have lead to conversion
  • snowplow_attribution_campaign_attributions: By campaign path and conversion level incremental table that attributes the conversion value based on various algorithms
  • snowplow_attribution_channel_attributions: By channel path and conversion level incremental table that attributes the conversion value based on various algorithms
  • snowplow_attribution_overview: The user defined report view (potentially showing ROAS)
  • snowplow_attribution_path_summary: For each unique path, a summary of associated conversions, optionally non-conversions and revenue
  • snowplow_attribution_paths_to_non_conversion: Customer id and the paths the customer has followed that have not lead to conversion. Optional drop and recompute table, disabled by default.

In the Quick Start section you will find a step-by-step guide on how to operate the package as a whole.

Package flowPackage flow

Attribution Models

The package currently offers 4 different attribution models all of which are calculated by default into both the incremental base tables and the report tables. If you would like to filter on them for reporting you can do so with var('snowplow__attribution_list'). Please note, however, that they will still be available for the incremental base tables.

  • first_touch: Assigns 100% attribution to the first channel in each path.
  • last_touch: Assigns 100% attribution to the last channel in each path.
  • position_based: The first and last channels get 40% of the credit each, with the remaining channels getting the leftover 20% distributed evenly.
  • linear: Assigns attribution evenly between all channels on the path. ​

    Data processing model for the attribution packageData processing model for the attribution package

What is Marketing Attribution?

Marketing attribution determines which marketing tactics are contributing to sales or conversions by analysing the marketing touchpoints a consumer encounters on their journey to purchase. The aim is to determine which channels and marketing campaigns had the greatest impact on the decision to convert. There are many popular attribution models used by marketers which give insight into customers' behaviors, more specifically the pathways they took to purchase the product or service. This allows marketing teams to improve ROAS by changing marketing strategies and campaigns.

Benefits of using our package

The package was implemented with a glassbox philosophy in mind, making it very transparent for data consumers how the data is transformed. It is processed all in SQL inside your warehouse, there are no external dependencies e.g. extra Python packages or black-box ML algorithms.

It is also very flexible, there are no limitations on the touchpoints (it works on all channels or campaigns). You can use ad spend data from any 3rd party data sources. Although there are no direct dependencies on data sources either, it is recommended to be used together with the Unified Digital dbt package for a seamless setup without the need to overwrite default dbt macros.

It also does the heavy-lifting for you and provides you with incremental data models reducing unnecessary costs even if you would like to regularly analyze the data based on a shifting time window in your reports (e.g. last 30 days).

How to do Marketing Attribution with this package?

The purpose of this package is to allow an incremental, efficient way to do marketing attribution without having to read all your data every time.

In the below guide we will walk you through the data transformation process step-by-step in order for you to see how the source data changes downstream. This will give you and your team a transparent and easy-to-understand way to see how this package will lead you to valuable insights.

We also provide the Marketing Attribution Data App specifically to help your analysis by visualizing the output in the form of interactive dashboards as well as letting you capture datasets for comparison. It works in tandem with the package and will auto-update daily in case your package has been processed since then.

Sources you are going to need

1. Conversion source

You will also need a table where the conversion events are stored. If you use the snowplow_unified model and configure conversions to be modelled, you will have this information in your derived.snowplow_unified_conversions table (default):


In case you store your conversion events elsewhere in the data warehouse (e.g. because those are transactional data that you decide you do not want to process through Snowplow), you can mutate that source table into the format of the conversions table generated by the unified package, store it as a sql view, and refer to that in the variable snowplow__conversion_source (in the format of schema.view_name).

You could also union multiple sources (e.g. one transactional event data for automatic conversion events not dictated by user actions and then the rest, ending up in the unified_conversions table). Bear in mind though, that the incremental logic is dictated based on the last processed conversion in your conversion source, you need to make sure those tables are synced before the model is run.

Please bear in mind, if you would like to calculate the attribution for conversion events not processed by Snowplow, you would need to make sure these events contain a user_id type field which the package can rely on to find previous website visits for it to be able to attribute them to these conversions. Otherwise, your overall revenue will be more than the total attributed revenue.

Conversion formatting requirements

Fields to use (including the name) when creating the view:

  • event_id: can be dummy UUID, needs to be unique though
  • session_identifier (OPTIONAL): can be left out if it is the only source or NULL for unions with the unified.conversions table
  • user_identifier: the user id field to be used for identifying pre conversion website visits
  • user_id: the user id field to be used for identifying pre-conversion website visits
  • stitched_user_id: the user id field to be used for identifying pre conversion website visits (this will be used if conversion_stiching is enabled)
  • cv_value: the conversion amount
  • cv_tstamp: a timestamp field when the conversion took place
  • dvce_created_tstamp (OPTIONAL): a timestamp field, can be left out if it is the only source or the same as the cv_tstamp for unions with the unified.conversions table
  • cv_type: you can name the types of conversions to differentiate them, if needed, it should not be left NULL/blank
  • cv_tstamp_date: (DATABRICKS/SPARK ONLY): the date of the cv_tstamp value

2. Path source

You will also need a source table to track your user journey / path with fields to be able to classify your marketing channels. The perfect table for this is your derived.snowplow_unified_views table (default):


Alternatively, you could use the derived.snowplow_unified_sessions table as well, but bear in mind that this will mean only the first channel/campaign will be counted within a session, and you will have to make sure the correct field reference is used in the snowplow_attribution_paths_to_conversion table by overwriting the paths_to_conversion() macro in your project (e.g first_page_urlhost instead of page_urlhost).

As for campaigns,

tip

To fully finish the config you might need to overwrite the channel_classification() macro. In case your classification logic for attribution analysis needs to be the same as the one already configured in the snowplow_unified model you can simply leave the default macro which refers to that field.

You most likely have a warehouse with marketing (ad) spend information by channel and date, something like this:


To make it flexible to use what you already have, we suggest creating a view on top of the table you have, rename the fields that the model will use and add that view reference in var('snowplow__spend_source'). For more details on how to do this check out our Quick Start Guide.

4. User mapping source

In order to be able to attribute the value of a conversion to one or more channels and campaigns it is essential to have a shared user identifier field between the conversion events and the preceding website visits. If the Unified dbt package is used for the conversion and conversion path source, there are two out-of-the box stitching options to use:

  1. By enabling BOTH the snowplow__view_stitching the snowplow__conversion_stitching in the unified package, both sources will get the latest stitched_user_id field based on logged in user activity (see details here). By setting the snowplow__conversion_stitching variable in the attribution package to true, the package will consider that field to take as a basis for both the conversion and the path source. Please note that this could potentially become a costly operation due to the volume of the snowplow_unified_views table, as after each run all the id fields will be updated by the latest mapping.

  2. Alternatively (default from v.0.3.0 onwards), the package will rely on the snowplow__user_mapping_source defaulted to the derived.snowplow_unified_user_mapping table which will get used to take the latest logged in business user_id field per user_identifier, if available, otherwise it will keep the user_identifier value for both the conversions and conversion path source before the two are joined.

  3. When using custom sources or solutions, this logic could be overwritten in the dbt project using the snowplow_attribution_paths_to_conversion() dispatch macro. In case the optional snowplow_attribution_paths_to_non_conversion table is also in use, and it needs custom stitching logic, it is advised to disable it and create a custom model with the desired stitching logic.

Another option is if you alias your unique user id field as stitched_user_id in both your custom path and conversion source and set the snowplow__conversion_stitching variable to true, then the macro doesn't need to be altered but it will rely on these fields when linking conversions to paths.

One-off setup

Once you have the sources ready you need to adjust the package settings that fit your business logic. This has to be done once, otherwise your data might be misaligned due to the incremental logic or you would need to run a full-refresh:

Decide on your sessionization logic

By default, Snowplow only considers the first pageview of a session important from an attribution point of view and disregards campaign information from subsequent page_views. Google Analytics, on the other hand separates session as soon as campaign information is given. Set var('consider_intrasession_channels') variable to false in case you would like to follow Snowplow's logic, not GA's. If you opt for this calculation consider changing the var('snowplow__conversion_path_source') to {{ target.schema }}.derived.snowplow_unified_sessions for performance benefits.

Use the variable snowplow__conversion_hosts to restrict which hosts to take into account for conversions.

Filter unwanted / wanted channels & campaigns

You can specify a list of channels for the variable snowplow__channels_to_exclude to exclude them from analysis (if kept empty all channels are kept). For example, users may want to exclude the 'Direct' channel from the analysis.

You can also do the opposite, filter on certain channels to include in your analysis. You can do so by specifying them in the list captured within the variable snowplow__channels_to_include.

You can do either for campaigns, too, with the snowplow__channels_to_exclude and snowplow__channels_to_include variables.

Transform paths

In order to reduce unneccesarily long paths you can apply a number of path transformations that are created as part of user defined functions automatically in your warehouse by the package.

In order to apply these transformations, all you have to do is to define them in the snowplow__path_transforms variable as a list of dictionaries, with the transformation name as key and optionally the parameter as value (for remove_if_last_and_not_all and remove_if_not_all). If the transformation requires no parameter you can just use null as values for the dictionary. For more details on how to do this, check out the configuration page E.g.: {'exposure_path': null, 'remove_if_last_and_not_all': 'direct'}

Path transform options

Paths to conversion are often similar, but not identical. As such, path transforms reduce unnecessary complexity in similar paths before running the attribution algorithm. The following transformations are available: ​

  1. exposure (default): the same events in succession are reduced to one: A → A → B becomes A → B, a compromise between first and unique

  2. unique: all events in a path are treated as unique (no reduction of complexity). Best for smaller datasets (small lookback window) without a lot of retargeting

  3. first: keep only the first occurrence of any event: A → B → A becomes A → B, best for brand awareness marketing

  4. remove_if_last_and_not_all: requires a channel to be added as a parameter, which gets removed from the latest paths unless it removes the whole path as it is trying to reach a non-matching channel parameter: E.g target element: A path: A → B → A → A becomes A → B

    A common requirement is to take the last non-direct clicks only into account during attribution analysis. If you use the remove_if_last_and_not_all path transform with direct as the parameter to remove it from the end of the path (unless they are all direct) the last-touch attribution results should show the outcome you need. You can also filter out the direct channels altogether with the snowplow__channels_to_exclude variable.

  5. remove_if_not_all: requires a channel to be added as a parameter, which gets removed from the path altogether unless it would result in the whole path's removal: E.g target element: A path: A → B → A → A becomes B

    Apart from this, you can also restrict how far in time (var('snowplow_path_lookback_days')) and steps (var('snowplow_path_lookback_steps')) you want to allow your path to go from the actual conversion event.

Other, macro based setup

tip

To overwrite these macros correctly with those in your project, ensure you prefix the macro name by default__ in the definition e.g.

{% macro default__channel_classification() %}
case when (...)
{% endmacro %}

channel_classification macro

The channel_classification macro is used to classify each marketing touchpoint from your path source deciding which marketing channel it was driven by. Be default it expects an already classified field called default_channel_group. In case the Attribution package is used together with the Unified Digital package, it is advisable to classify your pageviews and sessions upstream using the channel_group_guery() macro. For an in-depth guide on how to achive this check the channel group query macro documentation.

attribution_overview macro

Defines the sql for the view called attribution overview which provides the main report calculating the return on advertising spend (ROAS). In order to do that you would need a marketing spend table as an additional source which will contain the spend information by channel and or campaign with a timestamp to filter on the period. If you are happy with the logic of the macro, you can just define your spend table in the snowplow__spend_source variable and let it run.

paths_to_conversion macro

Macro to allow flexibility for users to modify the definition of the paths_to_conversion incremental table. By default the incremental table uses the snowplow_unified_views and snowplow_unified_conversions source tables but this macro allows for flexibility around that. You can also restrict which conversion events to take

Adding tests (Optional)

Please note that the Unified data model allows nulls on user_identifier field, but the Attribution package does not, by default we filter out nulls within the paths_to_conversion macro when joining on the path_source and also on the conversion_source using the conversion_clause variable, which by default includes the null filter. We encourage users to add their own tests in the appropriate level (e.g. in the derived.snowplow_unified_views level) to make sure you don't accidentally exclude those events without user_identifier that are null (e.g. due to a tracking issue).

Output

Incremental data models to prepare for attribution analysis:

  1. The derived.snowplow_attribution_paths_to_conversion model will aggregate the paths the customer has followed that have lead to conversion based on the path transformation and other limitations such as the path_lookback_step or path_lookback_days variable i.e. it combines the path and conversion source tables to produce an outcome. It looks like this:
customer_idcv_tstamprevenuechannel_pathchannel_transformed_pathcampaign_pathcampaign_transformed_path
user_id12022-06-11 15:3320.42DirectDirectcamp1 > camp2camp1 > camp2
user_id22022-07-30 11:5524Direct > DirectDirectcamp1camp1
user_id32022-06-08 20:1850Direct > DirectDirectcamp2 > camp1camp2 > camp1
user_id12022-07-25 07:52140Organic_Search > Direct > Organic_SearchOrganic_Search > Direct > Organic_SearchCampaign 2 > Campaign 2 > Campaign 1 > Campaign 1Campaign 2 > Campaign 1
  1. The derived.snowplow_attribution_channel_attributions unnests the paths from paths_to_conversion into their separate rows and calculates the attribution amount for that specific path step for each of the sql based attribution models:
composite_keyevent_idcustomer_idcv_tstampcv_total_revenuechannel_transformed_pathchannelsource_indexpath_lengthfirst_touch_attributionlast_touch_attributionlinear_attributionposition_based_attribution
id1_Video0event_1user_id12023-07-07 13:05:55.000200VideoVideo01200200200200
id2_Display_Other0event_2user_id22023-07-19 04:27:51.00066.5Display_Other > Organic_SearchDisplay_Other0266.5033.2533.25
id3_Organic_Search1event_2user_id22023-07-19 04:27:51.00066.5Display_Other > Organic_SearchOrganic_Search12066.533.2533.25
  1. The derived.snowplow_attribution_campaign_attributions does the same, only for campaigns not channels.

Drop and recompute reporting tables/views:

  1. The derived.snowplow_attribution_path_summary shows the campaign/channel paths and the associated conversions (and optionally non-conversions, if the path_to_non_conversions table is enabled through its related variable enable_path_to_non_conversions)
transformed_pathconversionsnon_conversionsrevenue
Direct32594.42
Organic_Search226206.5
Referral > Direct010
Video12200
Organic_Search > Paid_Search_Other020
  1. The view called derived.snowplow_attribution_overview is tied to a dispatch macro of the same name which lets you overwrite it in your project, if needed. Given you specify your var('snowplow__spend_source') it will calculate the ROAS for you for each channel and campaign:
path_typeattribution_typetouch_pointin_n_conversion_pathsattributed_conversionsmin_cv_tstampmax_cv_tstampspendsum_cv_total_revenueattributed_revenueroas
campaignfirst_touchCampaign 2212023-07-19 04:27:51.0002023-07-25 07:52:34.000100,000206.5206.50.002065
campaignlast_touchCampaign 1312023-07-07 13:05:55.0002023-07-30 11:55:24.000100,0003643640.00364
channelfirst_touchDisplay_Other112023-07-19 04:27:51.0002023-07-19 04:27:51.000100,00066.566.50.000665
channelfirst_touchOrganic_Search202023-07-19 04:27:51.0002023-07-25 07:52:34.000100,000206.500

Manifest table

We have included a manifest table to log information about the setup variables each time the paths_to_conversion incremental table runs to help prevent and debug issues.