Versioning, patching, and marking schemas as superseded
Every schema has a version, starting from 1-0-0. As your schema evolves, you will need to create new versions of it. All previous versions of a schema remain available to ensure backwards-compatibility.
Versioning has an important role in telling Snowplow Loaders how to handle the changes when loading into your data warehouse. For example, there may be a need to create new columns, update columns, or even create new tables.
It's important to understand when your change is breaking, and version correctly.
There are two kinds of schema changes:
- Non-breaking - a non-breaking change is backward compatible with historical data and increments the
patchnumber i.e.1-0-0->1-0-1, or the middle digit i.e.1-0-0->1-1-0. - Breaking - a breaking change is not backwards compatible with historical data and increments the
modelnumber i.e.1-0-0->2-0-0.
Different data warehouses handle schema evolution slightly differently. Use the table below as a guide for incrementing the schema version appropriately.
| Redshift | Snowflake, BigQuery, Databricks | |
|---|---|---|
| Add / remove / rename an optional field | Non-breaking | Non-breaking |
| Add / remove / rename a required field | Breaking | Breaking |
| Change a field from optional to required | Breaking | Breaking |
| Change a field from required to optional | Breaking | Non-breaking |
| Change the type of an existing field | Breaking | Breaking |
| Change the size of an existing field | Non-breaking | Non-breaking |
In Redshift and Databricks, changing size may also mean type change. For example, changing the maximum integer from 30000 to 100000. See our documentation on how schemas translate to database types.
Changing a schema without incrementing the version has several risks:
- Events that were previously valid could become invalid against the new changes
- Your warehouse Loader, which updates the table according to the schema, could get stuck if it’s not possible to cast the data in the existing table column to the new definition (e.g. if you change a field type from a string to a number)
- Similarly, data models or other applications consuming the data downstream might not be able to deal with the changes
Aim to treat each schema version as immutable. However, if you can't avoid making changes to an existing schema version, you have two options: patching the schema, or marking it as superseded.
Patch a schema
Patching is available if your schema version isn't yet in production.
For Snowplow CDI customers, patching is disabled for production pipelines.
Patching in a production environment can break your loading, especially if your patch contains breaking changes.
Also, never patch a schema version that exists in a production environment, even if you are doing the patching in a development environment. This will lead to problems later when you try to promote that schema to production.
If you are working on a new schema version in a development environment, patching is safer as it won't corrupt any production data.
Before:
After:
Follow these instructions to patch a schema.
Mark a schema as superseded
If your events are failing in production because of an incorrect schema, you might not be able to instantly update the tracking code to use a new schema version. This is a common situation for mobile tracking, for example. You can resolve this by marking the old schema version as superseded by the new schema version.
You need to be on Enrich 3.8.0+ and Iglu Server 0.11.0+ to use this feature. Additionally, if you are using Snowplow Mini or Snowplow Micro, you will need version 0.17.0+ or 1.7.1+ respectively.
Before:
After:
Here's how this works, at a glance:
- Suppose schema
1-0-2is wrong. - Draft a new schema version correcting the issue.
- In the new schema, add the following field at the root:
"$supersedes": ["1-0-2"]. - Set the version of the new schema as usual, i.e.
1-0-3if there are no breaking changes or2-0-0if there are. - Add the new schema to your production environment.
- Events or entities that use schema
1-0-2will now be automatically updated (in the Enrich application) to use version1-0-3, and will be validated against that version. (A special entity will be added to these events to record this fact.)
Example
Let’s say we have a mobile application. We are sending certain events from this application, and these events contain entities with following schema:
Geolocation 1-0-2
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"description": "Schema for client geolocation contexts",
"self": {
"vendor": "com.acme",
"name": "geolocation",
"format": "jsonschema",
"version": "1-0-2"
},
"type": "object",
"properties": {
"latitude": {
"type": "number",
},
"longitude": {
"type": "number",
}
},
"additionalProperties": false
}
Later, we realize that when implementing tracking, we have mistakenly included an altitude field in the entity objects:
Wrong tracking code (iOS)
let event = ScreenView(name: "Screen")
event.entities.add(
SelfDescribingJson(schema: "iglu:com.acme/geolocation/jsonschema/1-0-2",
andDictionary: [
"latitude": 38.7223,
"longitude": 9.1393,
"altitude": 20 // extra field not defined in the schema
])!)
tracker.track(event)
Since additionalProperties is set to false, all events with the altitude field end up as failed events.
We can create a new schema with version 1-0-3 that contains the altitude field and then use this schema in the next version of the application. This would make the events valid. However, users will not update their application to the new version all at once. Events from the older version will continue to come, therefore there will still be failed events until all users start to use a newer version.
To solve this problem, we simply add the $supersedes definition to the new schema.
Geolocation 1-0-3 with $supersedes
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"$supersedes": ["1-0-2"],
"description": "Schema for client geolocation contexts",
"self": {
"vendor": "com.acme",
"name": "geolocation",
"format": "jsonschema",
"version": "1-0-3"
},
"type": "object",
"properties": {
"latitude": {
"type": "number",
},
"longitude": {
"type": "number",
},
"altitude": {
"type": "number",
}
},
"additionalProperties": false
}
Now, when we receive events from the mobile application that use schema 1-0-2, these events will be updated to use schema 1-0-3 and will be validated against that schema. Therefore, these events will be valid.
To record this fact, an extra entity will be added to all such events:
{
"schema": "iglu:com.snowplowanalytics.iglu/validation_info/jsonschema/1-0-0",
"data": {
"originalSchema": "iglu:com.acme/geolocation/jsonschema/1-0-2",
"validatedWith": "1-0-3"
}
}
Finally, if we browse schema version 1-0-2, we will see that Iglu Server automatically keeps track of which schema supersedes which. Specifically, it will now contain a $supersededBy definition:
Geolocation 1-0-2 with $supersededBy
{
"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
"$supersededBy": "1-0-3",
"description": "Schema for client geolocation contexts",
"self": {
"vendor": "com.acme",
"name": "geolocation",
"format": "jsonschema",
"version": "1-0-2"
},
"type": "object",
"properties": {
"latitude": {
"type": "number",
},
"longitude": {
"type": "number",
}
},
"additionalProperties": false
}
Usage
The $supersedes field states that the schema version defined in the self part supersedes the schema versions listed in the $supersedes field (one or more). Its value must be an array of strings (even if it only includes one item). For example:
...
"$supersedes": ["1-0-2", "1-0-3"],
...
Once you’ve defined the $supersedes field for a schema version, you can’t update it — even in the development environment where patching is allowed. However, you can change which schema version supersedes which by creating new schema versions.
For example, if version 1-0-2 is defined to supersede version 1-0-1, and you create version 1-0-3 which also supersedes 1-0-1, then 1-0-1 will be superseded by the newest version, i.e. 1-0-3. See diagrams below for more information on how this is determined.
Note that various pipeline components, most importantly Enrich (including Enrich embedded in Snowplow Mini and Snowplow Micro), cache schemas to improve performance. The default caching time is 10 minutes (it’s controlled by the Iglu Resolver configuration). This means that the effect of superseding a schema will not be immediate.
Rules
A schema version can only supersede previous versions
For example, 1-0-2 can supersede 1-0-1, but can’t supersede 1-0-3, 1-1-0, or 2-0-0. Iglu Server will reject a schema with a definition that breaks this rule.
| ✅ OK | ❌ Invalid |
A schema version can supersede multiple previous versions at once
Events referencing either of those previous versions will be treated as explained above.
| ✅ OK |
At any given moment, a schema version can only be superseded by a single schema version
Iglu Server automatically upholds this rule.
For example, if you specify that 1-0-3 supersedes 1-0-2 and (later) that 1-0-4 also supersedes 1-0-2, the latest schema — 1-0-4 — will automatically become the one that supersedes 1-0-2.
| Specified | Becomes |
The same happens if you specify “chains”, e.g. 1-0-3 supersedes 1-0-2 and 1-0-4 supersedes 1-0-3. This will be automatically updated so that 1-0-4 supersedes 1-0-2 and 1-0-3.
| Specified | Becomes |