BigQuery Loader 2.0.0 upgrade guide

Configuration#

BigQuery Loader 2.0.0 brings changes to the loading setup. It is no longer necessary to configure and deploy three independent applications (Loader, Repeater and Mutator in 1.X) in order to load your data to BigQuery. Starting from 2.0.0, only one application is needed, which naturally introduces some breaking changes to the configuration file structure.

See the configuration reference for all possible configuration parameters and the minimal configuration samples for each of supported cloud environments.

Infrastructure#

Apart from Repeater and Mutator, other infrastructure components have become obsolete:

The types PubSub topic connecting Loader and Mutator.
The failedInserts PubSub topic connecting Loader and Repeater.
The deadLetter GCS bucket used by Repeater to store data that repeatedly failed to be inserted into BigQuery.

Events table format#

Starting from 2.0.0, BigQuery Loader changes its output column naming strategy. For example, for ad_click event:

Before an upgrade, the corresponding column would be named unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1_0_0.
After an upgrade, new column will be named unstruct_event_com_snowplowanalytics_snowplow_media_ad_click_event_1.

All self-describing events and entities will be loaded to new "major version"-oriented columns. Old "full version"-oriented columns would remain unchanged, but no new data would be loaded into them (the 2.0.0 loader just ignores these columns).

The new column naming scheme has several advantages:

Fewer columns created (BigQuery has a limit on the total number of columns)
No need to update data models (or use complex macros) every time a new minor version of a schema is created

The catch is that you have to follow the rules of schema evolution more strictly to ensure data from different schema versions can fit in the same column — see below.

Consolidating old and new columns

If you are using Snowplow dbt models, they will automatically consolidate the data between _1_0_0 and _1 style columns, because they look at the major version prefix (e.g. _1), which is common to both.

If you are not using Snowplow dbt models but still use dbt, you can employ this macro to manually aggregate the data across old and new columns.

Enable legacy mode for the old table format#

To simplify migration to the new table format, it is possible to run the 2.x loader in legacy mode, so it loads self-describing events and entities using the old column names of the 1.x loader.

Option 1: In the configuration file, set legacyColumnMode to true. When this mode is enabled, the loader uses the legacy column style for all self-describing events and entities.

Option 2: In the configuration file, set legacyColumns to list specific schemas for which to use the legacy column style. This list is used when legacyColumnMode is false (the default).

For example:

json
"legacyColumns": [
  "iglu:com.example/legacy_a/jsonschema/1-0-0",
  "iglu:com.example/legacy_b/jsonschema/1-*-*"
]

Recovery columns#

What is schema evolution?#

One of Snowplow’s key features is the ability to define custom schemas and validate events against them. Over time, users often evolve the schemas, e.g. by adding new fields or changing existing fields. To accommodate these changes, BigQuery Loader 2.0.0 automatically adjusts the database tables in the warehouse accordingly.

There are two main types of schema changes:

Breaking: The schema version has to be changed in a major way (1-2-3 → 2-0-0). As of BigQuery Loader 2.0.0, each major schema version has its own column (..._1, ..._2, etc, for example: contexts_com.snowplowanalytics_ad_click_1).

Non-breaking: The schema version can be changed in a minor way (1-2-3 → 1-3-0 or 1-2-3 → 1-2-4). Data is stored in the same database column.

Loader tries to format the incoming data according to the latest version of the schema it saw (for a given major version, e.g. 1-*-*). For example, if a batch contains events with schema versions 1-0-0, 1-0-1 and 1-0-2, the loader derives the output schema based on version 1-0-2. Then the loader instructs BigQuery to adjust the database column and load the data.

Recovering from invalid schema evolution#

Let's consider these two schemas as an example of breaking schema evolution (changing the type of a field from integer to string) using the same major version (1-0-0 and 1-0-1):

json
{
   // 1-0-0
   "properties": {
      "a": {"type": "integer"}
   }
}

json
{
   // 1-0-1
   "properties": {
      "a": {"type": "string"}
   }
}

With BigQuery Loader 1.x, data for each version would go to its own column — no issue. With BigQuery Loader 2.x, there is only one column. But strings and integers can’t coexist!

To avoid crashing or losing data, BigQuery Loader 2.0.0 proceeds by creating a new column for the data with schema 1-0-1, e.g. contexts_com_snowplowanalytics_ad_click_1_0_1_recovered_9999999, where:

1_0_1 is the version of the offending schema;
9999999 is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one).

If you create a new schema 1-0-2 that reverts the offending changes and is again compatible with 1-0-0, the data for events with that schema will be written to the original column as expected.

tip

You might find that some of your schemas were evolved incorrectly in the past, which results in the creation of these “recovery” columns after the upgrade. To address this for a given schema, create a new minor schema version that reverts the breaking changes introduced in previous versions. (Or, if you want to keep the breaking change, create a new major schema version.) You can set it to supersede the previous version(s), so that events are automatically validated against the new schema.

note

If events with incorrectly evolved schemas never arrive, then the recovery column would not be created.

You can read more about schema evolution and how recovery columns work here.