Using multi-valued entities

When building custom derived models you may find yourself needing to use a multi-valued entity, i.e. an entity with multiple entries in the array. These fields should not be extracted as part of the event this run table because it would lead to duplicate rows, and so are only extracted as part of the custom model where you are either keeping the rows at that level or aggregating back up.

In this page we will just showcase the SQL required, including some of our specialist macros, to extract out these entities into a 1-row per value table, for you to then use in any specific model.

SQL to extract each value

Snowflake
BigQuery
Databricks
Redshift

To make use of a multi-valued entity in BigQuery, you can make use of the LATERAL FLATTEN function.

SELECT
    ...
    r.value:myField::varchar,  -- field will be variant type so important to cast
    ...
FROM
    {{ ref('snowplow_<package_name>_base_events_this_run') }} t,
    LATERAL FLATTEN(input => t.contexts_my_entity_1) r

To make use of a multi-valued entity in BigQuery, you can make use of the unnest function. Note that you will need to do this for each version of the context.

SELECT
    ...
    my_ent.my_field AS my_field,
    ...
FROM
    {{ ref('snowplow_<package_name>_base_events_this_run') }}
LEFT JOIN
    unnest(contexts_my_entity_1_0_0) AS my_ent -- left join to avoid discarding events without values in this entity

To make use of a multi-valued entity in BigQuery, you can make use of the LATERAL VIEW EXPLODE function.

SELECT
    ...
    my_ent.my_field,
    ...
FROM
    {{ ref('snowplow_<package_name>_base_events_this_run') }}
    LATERAL VIEW EXPLODE(contexts_my_entity_1) AS my_ent

As the fields in Redshift are part of a different table, we need to join that table onto the events this run model. We do this in 3 steps:

Get the limits for the current run to reduce the scan on the entity table
Use a CTE and macro to get the records from the entity table, adding fields needed to deduplicate
Join this onto events this run, using a specific set of conditions to manage duplicates but keep genuine same-valued records.

We use the snowplow_utils.get_sde_or_context macro, which takes 5 arguments:

schema: The schema in your warehouse that the table is in
identifier: The table name for the entity
lower_limit: The lower limit of the event time to scan
upper_limit: The upper limit of the event time to scan
prefix: The prefix to apply to all column names
single_entity: If it is a single-valued entity

In this case, we want to make sure to set single_entity to false as we are explicitly using a multi-valued entity.

{# Get the limits for the run to reduce the scan on the entity table#}
{%- set lower_limit, upper_limit = snowplow_utils.return_limits_from_model(ref('snowplow_<package_name>_base_sessions_this_run'),
                                                                            'start_tstamp',
                                                                            'end_tstamp') %}

{# Use the macro as part of a CTE which will select the relevant data, prefix the columns, and add the de-dupe fields #}
with {{ snowplow_utils.get_sde_or_context(var('snowplow__atomic_schema', 'atomic'), 'contexts_my_entity_1', lower_limit, upper_limit, 'my_ent', single_entity = false) }},

select
    a.*,
    b.my_ent_my_field
from {{ ref('snowplow_<package_name>_base_events_this_run') }} a
left join contexts_my_entity_1 b on 
    a.event_id = b.yauaa__id  -- match the event id between the tables, note the dobule underscore
    and a.collector_tstamp = b.yauaa__tstamp -- match the collector timestamp between the tables, note the double underscore
    and mod(b.yauaa__index, a.event_id_dedupe_count) = 0 -- ensure one version of each potentially duplicated entity in context

This table can then be used to do any further analysis required, either in the same model or for use in another model.

SQL to extract each value​

Want to see a custom demo?

SQL to extract each value