Fractribution Quickstart
๐ Take me there! ๐
The Fractribution Package is no longer maintained, please refer to the Attribution package for marketing attribution analysis with Snowplow
Requirementsโ
In addition to dbt being installed and a web events dataset being available in your database:
- have
snowplow_web_page_views
derived table available as a source (generated by the snowplow_web package) - have a table with revenue data by users (
domain_userid
,user_id
) that serves as another source for the fractribution calculations, you can choose either of the following options:- your
atomic.events
table with any self-describing event that captures revenue data - the
snowplow_ecommerce_transaction_interactions
derived table generated by the snowplow_ecommerce package - any custom incremental table that is built on top of the
snowplow_web
model that results in an aggregated revenue dataset
- your
python
ordocker
installed (or you can use Snowpark on Snowflake)
Installationโ
Make sure to create a new dbt project and import this package via the packages.yml
as recommended by dbt, or add to an existing top level project. Do not fork the packages themselves.
Check dbt Hub for the latest installation instructions, or read the dbt docs for more information on installing packages. If you are using multiple packages you may need to up/downgrade a specific package to ensure compatibility.
packages:
- package: snowplow/snowplow_fractribution
version: 0.3.6
Make sure to run the dbt deps
command after updating your packages.yml
to ensure you have the specified version of each package installed in your project.
Setupโ
1. Override the dispatch order in your projectโ
To take advantage of the optimized upsert that the Snowplow packages offer you need to ensure that certain macros are called from snowplow_utils
first before dbt-core
. This can be achieved by adding the following to the top level of your dbt_project.yml
file:
dispatch:
- macro_namespace: dbt
search_order: ['snowplow_utils', 'dbt']
If you do not do this the package will still work, but the incremental upserts will become more costly over time.
2. Set variablesโ
The package has some variables that need to be set before it can be run, you should edit these in your dbt_project.yml
file. Further customization can be done via the variables listed in the configuration page.
snowplow__conversion_window_start_date
: The start date in UTC for the window of conversions to includesnowplow__conversion_window_end_date
: The end date in UTC for the window of conversions to includesnowplow__conversion_hosts
:url_hosts
to processsnowplow__path_transforms
: A dictionary of path transforms and their arguments (see Path Transform Options section)
vars:
snowplow_fractribution:
snowplow__conversion_window_start_date: '2022-01-01'
snowplow__conversion_window_end_date: '2023-02-01'
snowplow__conversion_hosts: ['mysite.com']
snowplow__path_transforms: {'exposure_path' : null}
If you are using Snowflake, you can automatically run the python scripts using Snowpark when running the dbt package. This is done using macros that create and run a stored procedure on Snowpark after the dbt models have completed.
To enable this you need to set some additional variables. For example, to enable this and use the last_touch
attribution model:
vars:
snowplow_fractribution:
snowplow__run_python_script_in_snowpark: true
snowplow__attribution_model_for_snowpark: 'last_touch'
Modifying the conversions source
By default the snowplow__conversions_source
is your atomic events table. In most cases this is likely to be what you want to use, however you may wish to use the in-built conversions modeling as part of our web package if you have already defined this, by setting snowplow__conversions_source
to "{{ ref('snowplow_web_sessions') }}"
.
Alternatively, if you are using Redshift/Postgres you may wish to include additional fields from a Self-Describing Event, or an Entity. To do this, you should create a new model in your project, e.g. models/snowplow/snowplow_joined_events_table.sql
which should have something like the following content:
For more information about dealing with duplicates and the macro in this code, make sure to see our deduplication docs.
with {{ snowplow_utils.get_sde_or_context('atomic',
'my_custom_context',
"'{{ get_lookback_date_limits("min") }}'",
"'{{ get_lookback_date_limits("max") }}'",
'my_prefix')}}
select
events.*,
b.*
from {{ source('atomic', 'events') }} as events
left join nl_basjes_my_prefix_1 b on
events.event_id = b.my_prefix__id
and events.collector_tstamp = b.my_prefix__tstamp
where
-- use the appropriate partition key to filter on in addition to this, add a bit of a buffer if it is not derived_tstamp
date(derived_tstamp) >= '{{ get_lookback_date_limits("min") }}'
and date(derived_tstamp) <= '{{ get_lookback_date_limits("max") }}'
Finally ensure you set the snowplow__conversions_source
to "{{ ref('snowplow_joined_events_table') }}"
3. Configure macrosโ
While the macro and matching columns/tables use the term channel, it is entirely possible to define this in terms of something else e.g. a campaign.
All the below macros are created with the intention to let users modify them to fit their personal use case. If you wish to change this, copy the macro from the macros folder in the snowplow_fractribution
package (at [dbt_project_name]/dbt_packages/snowplow_fractribution/macros/conversion_clause.sql
) and add it to the macros folder of your own dbt project where you are free to make any alterations. You will find a detailed guide / illustration with sample code within the individual macros themselves.
To overwrite these macros correctly with those in your project, ensure you prefix the macro name by default__
in the definition e.g.
{% macro default__conversion_value() %}
tr_total/100
{% endmacro %}
conversion_clause
macroโ
The
conversion_clause
macro specifies how to filter Snowplow events to only succesfful conversion events. How this is filtered will depend on your definition of a conversion. The default is filtering to events wheretr_total > 0
, but this could instead filter onevent_name = 'checkout'
, for example. If you are using the e-commerce model, you will still need to set this for the fractribution code to run (even though all events are conversions in the e-commerce model), in this case change it totransaction_revenue > 0
.
conversion_value
macroโ
The
conversion_value
macro specifies either a single column or a calculated value that represents the value associated with that conversion. The default istr_total
, but revenue or a calculation using revenue and discount_amount from the default e-commerce schema, for example, could similarly be used.
channel_classification
macroโ
The
channel_classification
macro is used to perform channel classifications. This can be altered to generate your expected channels if they differ from the channels generated in the default macro. It is highly recommended that you examine and configure this macro when using your own data, as the ROAS calculations and attribution calculations will run against these channel definitions, and the default values will not consider any custom marketing parameters.
channel_spend
macroโ
The
channel_spend
macro is used to query the spend by channels. It requires a user supplied SQL script to extract the total ad spend by channel.Required output format:
channel
: STRING NOT NULLspend
: FLOAT64 (or other warehouse equivalent) (Use the same monetary units as conversion revenue, and NULL if unknown.)
4. Run the modelโ
Execute the following either through your CLI, within dbt Cloud, or within Snowplow BDP
dbt run --select snowplow_fractribution
5. Run the python script to generate the final modelsโ
Depending on your setup, please follow the appropriate steps below. All these methods will create the following tables in your warehouse:
snowplow_fractribution_path_summary_with_channels
snowplow_fractribution_report_table
snowplow_fractribution_channel_attribution
Run on Snowflake using Snowpark
If you enabled this already, the tables will have already been built as part of step 4. If you wish to just re-run the attribution modeling for some reason you can run the following:
dbt run --select snowplow_fractribution_call_snowpark_macros
Locally run python
To run the fractribution script locally in Python, we recommend using a virtual environment such as one in conda
or pyenv
.
Example using conda:
conda create --name fractribution_env -c https://repo.anaconda.com/pkgs/snowflake python=3.8 absl-py
conda activate fractribution_env
I. Install packagesโ
You can install the packages using pip install -r dbt_packages/snowplow_fractribution/utils/requirements.txt
(or the appropriate path from your terminal working directory).
Please note that some of the libraries are adapter specific. These are listed in the requirements
file, and you can also find the necessary list for each adapter below:
- BigQuery
- Databricks
- Snowflake
- Redshift
absl-py
==1.2.0
google-cloud-bigquery
==3.5.0
absl-py
==1.2.0
,databricks-sql-connector
==2.1.0
pandas
absl-py
==1.2.0
,snowflake-snowpark-python
==0.11.0
absl-py
==1.2.0
,redshift_connector
==2.0.910
M1 Instructions (for Snowflake only)
There is an issue with running Snowpark on M1 chips. A workaround recommended by Snowflake is to set up a virtual environment that uses x86 Python:
CONDA_SUBDIR=osx-64 conda create -n fractribution_env python=3.8 absl-py -c https://repo.anaconda.com/pkgs/snowflake
conda activate fractribution_env
conda config --env --set subdir osx-64
II. Set the connection parameters in your terminalโ
- BigQuery
- Databricks
- Snowflake
- Redshift
export project_id=project_id\
export bigquery_dataset=bigquery_dataset\
export google_application_credentials=google_application_credentials
export databricks_schema=derived_schema_name\
export databricks_server_hostname=hostname\
export databricks_http_path=http_path\
export databricks_token=token
export snowflake_account=my_account\
export snowflake_user=sf_user\
export snowflake_password=password\
export snowflake_user_role=special_role\
export snowflake_warehouse=warehouse_name\
export snowflake_database=database_name\
export snowflake_schema=derived_schema_name
export redshift_host=redshift_host\
export redshift_database=redshift_database\
export redshift_port=redshift_port\
export redshift_user=redshift_user\
export redshift_password=redshift_password\
export redshift_schema=redshift_schema
III. Run the fractribution scriptโ
Run the adapter specific main fractribution script by specifying the conversion window start and end dates, and the attribution model (if you are not using the default shapley
, see here for more options). Example:
- BigQuery
- Databricks
- Snowflake
- Redshift
python main_snowplow_bigquery.py --conversion_window_start_date '2022-06-03' --conversion_window_end_date '2022-08-01' --attribution_model last_touch
python main_snowplow_databricks.py --conversion_window_start_date '2022-06-03' --conversion_window_end_date '2022-08-01' --attribution_model last_touch
python main_snowplow_snowflake.py --conversion_window_start_date '2022-06-03' --conversion_window_end_date '2022-08-01' --attribution_model last_touch
python main_snowplow_redshift.py --conversion_window_start_date '2022-06-03' --conversion_window_end_date '2022-08-01' --attribution_model last_touch
To help debug easier we have included logs to be printed each time a table is created in the warehouse, there should be 3 in total (snowplow_fractribution_path_summary_with_channels
, snowplow_fractribution_channel_attribution
and snowplow_fractribution_report_table
). There is also a --verbose
flag you can include in your shell prompt when executing the python scripts which may give you an even more detailed breakdown depending on your warehouse.
Run python with docker
I. Pull the docker image โโ
You can pull the latest docker image from Docker Hub: docker pull snowplow/fractribution:latest
. Alternatively, you can pull it based on the package version: docker pull snowplow/fractribution:0.2.0
II. Set the environment variablesโ
Add the necessary environment variables to an environment file, e.g. configs.env
. The necessary variables will differ depending on the data warehouse you are using. The easiest way to determine the variables you need to set is to check the Dockerfile in the fractribution dbt package: dbt-snowplow-fractribution/utils/Dockerfile
. Please note that in case of BigQuery, the google_application_credentials
env var is not needed for Docker as you mount this as a volume at run time.
Below is an example of the config.env
file (set up for Snowflake). Note the last 4 variables are named the same across all warehouses. You do not need to specify the attribution model if using the default, shapley
:
snowflake_account=youraccount.ap-southeast-2
snowflake_user=user
snowflake_password=abc123
snowflake_user_role=DBT
snowflake_warehouse=WH
snowflake_database=snowplow
snowflake_schema=FRACTRIBUTION_DERIVED
conversion_window_start_date=2022-06-03
conversion_window_end_date=2022-08-01
attribution_model=last_touch
warehouse=snowflake
III. Run the docker containerโ
Run the docker container :
- BigQuery
- Databricks
- Snowflake
- Redshift
With BigQuery you need to mount your service account keyfile when running the docker image
docker run --rm --env-file /path/to/env/file/configs.env -v /path/to/yourkeyfile.json:/keyfile.json -it snowplow/fractribution:latest
docker run --rm --env-file /path/to/env/file/configs.env -it snowplow/fractribution:latest
docker run --rm --env-file /path/to/env/file/configs.env -it snowplow/fractribution:latest
docker run --rm --env-file /path/to/env/file/configs.env -it snowplow/fractribution:latest
To help debug easier we have included logs to be printed each time a table is created in the warehouse, there should be 3 in total (snowplow_fractribution_path_summary_with_channels
, snowplow_fractribution_channel_attribution
and snowplow_fractribution_report_table
). In order to run the script with a --verbose
flag you can set the verbose_run
variable to "true" in your env-file, which may give you an even more detailed breakdown depending on your warehouse.