Skip to main content

How schema definitions translate to the warehouse

Self-describing events and entities use schemas to define which fields should be present, and of what type (e.g. string, number). This page explains what happens to this information in the warehouse.

Location

Where can you find the data carried by a self-describing event or an entity?

Each type of self-describing event and each type of entity get their own dedicated tables. The name of such a table is composed of the schema vendor, schema name and its major version (more on versioning later).

note

All characters are converted to lowercase and all symbols (like .) are replaced with an underscore.

Examples:

KindSchemaResulting table
Self-describing eventcom.example/button_press/jsonschema/1-0-0com_example_button_press_1
Entitycom.example/user/jsonschema/1-0-0com_example_user_1

Inside the table, there will be columns corresponding to the fields in the schema. Their types are determined according to the logic described below.

note

The name of each column is the name of the schema field converted to snake case.

caution

If an event or entity includes fields not defined in the schema, those fields will not be stored in the warehouse.

For example, suppose you have the following field in the schema:

"lastName": {
"type": "string",
"maxLength": 100
}

It will be translated into a column called last_name (notice the underscore), of type VARCHAR(100).

Versioning

What happens when you evolve your schema to a new version?

Because the table name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new table:

SchemaResulting table
com.example/button_press/jsonschema/1-0-0com_example_button_press_1
com.example/button_press/jsonschema/1-2-0com_example_button_press_1
com.example/button_press/jsonschema/2-0-0com_example_button_press_2

When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing table automatically. For example, if you change the maxLength of a string field, the limit of the VARCHAR column would be updated accordingly.

Breaking changes

If you make a breaking schema change (e.g. change a type of a field from a string to a number) without creating a new major schema version, the loader will not be able to modify the table to accommodate the new data.

In this case, upon receiving the first event with the offending schema, the loader will instead create a new table, with a name like com_example_button_press_1_0_1_recovered_9999999, where:

  • 1-0-1 is the version of the offending schema
  • 9999999 is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one)

To resolve this situation:

  • Create a new schema version (e.g. 1-0-2) that reverts the offending changes and is again compatible with the original table. The data for events with that 1-0-2 schema will start going to the original table as expected.
  • You might also want to manually adapt the data in the ..._recovered_... table and copy it to the original one.

Note that this behavior was introduced in RDB Loader 6.0.0. In older versions, breaking changes will halt the loading process.

Nullability

Once the loader creates a column for a given schema version as NULLABLE or NOT NULL, it will never alter the nullability constraint for that column. For example, if a field is nullable in schema version 1-0-0 and not nullable in version 1-0-1, the column will remain nullable. (In this example, the Enrich application will still validate data according to the schema, accepting null values for 1-0-0 and rejecting them for 1-0-1.)

Types

How do schema types translate to the database types?

Nullability

All non-required schema fields translate to nullable columns.

Required fields translate to NOT NULL columns:

{
"properties": {
"myRequiredField": {"type": ...}
},
"required": [ "myRequiredField" ]
}

However, it is possible to define a required field where null values are allowed (the Enrich application will still validate that the field is present, even if it’s null):

"myRequiredField": {
"type": ["null", ...]
}

OR

"myRequiredField": {
"enum": ["null", ...]
}

In this case, the column will be nullable. It does not matter if "null" is in the beginning, middle or end of the list of types or enum values.

info

See also how versioning affects this.

Types themselves

note

The row order in this table is important. Type lookup stops after the first match is found scanning from top to bottom.

Json SchemaRedshift/Postgres Type
{
"enum": [E1, E2, ...]
}

The enum can contain more than one JavaScript type: string, number|integer, boolean. For the purposes of this number and integer are the same.

array, object, NaN and other types in the enum will be cast as fallback VARCHAR(4096).

If content size is longer than 4096 it would be truncated when inserted into the Redshift.

VARCHAR(M)

M is the maximum size of json.stringify(E*)

{
"type": ["boolean", "integer"]
}

OR

{
"type": ["integer", "boolean"]
}

VARCHAR(10)

{
"type": [T1, T2, ...]
}

VARCHAR(4096)

If content size is longer than 4096 it would be truncated when inserted into the Redshift.

{
"type": "string",
"format": "date-time"
}

TIMESTAMP

{
"type": "string",
"format": "date"
}

DATE

{
"type": "array"
}

VARCHAR(65535)

Content is stringified and quoted.

If content size is longer than 65535 it would be truncated when inserted into the Redshift.

{
"type": "integer",
"maximum": M
}
  • M 32767

SMALLINT

{
"type": "integer",
"maximum": M
}
  • 32767 < M 2147483647

INT

{
"type": "integer",
"maximum": M
}
  • M >2147483647

BIGINT

{
"type": "integer",
"enum": [E1, E2, ...]
}
  • Maximum E* 32767

SMALLINT

{
"type": "integer",
"enum": [E1, E2, ...]
}
  • 32767 < maximum E* 2147483647

INT

{
"type": "integer",
"enum": [E1, E2, ...]
}
  • Maximum E* > 2147483647

BIGINT

{
"type": "integer"
}

BIGINT

{
"multipleOf": B
}

INT

{
"type": "number",
"multipleOf": B
}
  • Only works for B=0.01

DECIMAL(36,2)

{
"type": "number"
}

DOUBLE

{
"type": "boolean"
}

BOOLEAN

{
"type": "string",
"minLength": M,
"maxLength": M
}
  • M is the same in minLength and maxLength

CHAR(M)

{
"type": "string",
"format": "uuid"
}

CHAR(36)

{
"type": "string",
"format": "ipv6"
}

VARCHAR(39)

{
"type": "string",
"format": "ipv4"
}

VARCHAR(15)

{
"type": "string",
"format": "email"
}

VARCHAR(255)

{
"type": "string",
"maxLength": M
}
  • enum is not defined

VARCHAR(M)

{
"enum": ["E1"]
}
  • E1 is the only element

CHAR(M)

M is the size of json.stringify("E1")

If nothing matches above, this is a catch-all.

VARCHAR(4096)

Values will be quoted as in JSON.

If content size is longer than 4096 it would be truncated when inserted into the Redshift.