Schema Resolution
This page describes the Schema resolution algorithm which is standard for all Iglu clients. Currently only Iglu Scala client fully follow this algorithm, while other clients may miss some parts, but we're working on making their behavior consistent.
1. Prerequisites
Before going further it is important to understand basic Iglu client configuration and essential concepts like Resolver, Registry (or Repository), Schema. Here is a quick overview of these concepts, if you're familiar with them you may want to skip this section.
Iglu clients are configured via JSON object described in dedicated Schema called resolver-config. Here we'll be using JSON resolver configuration which is platform independent and most wide-spread.
1.1 Resolver
Resolver is an primary object of Iglu Client library, which contains all logic necessary to fetch requested Schema from appropriate registry (repository) and cache it properly. Resolver has two main properties: cache size (cacheSize
) and list of registries (repositories
).
1.2 Registries
NOTE: term repository was deprecated. Registry is default term to use when referring to Schema storage. So far, we've not renamed all occurrences, so for now they can be used interchangeable.
Each registry in resolver configuration has several values common for all types of registries, such as name
, vendorPrefixes
and priority
. Also each registry has type, which is defined inside connection
property. The only one important thing here about type of repository is that each type has its own priority hardcoded inside client library. Below we'll refer to this hard-coded priority by classPriority
and to user-defined priority by instancePriority
Usually, the "safer" registry - the higher classPriority
it has, so local repositories are more preferable than remote.
1.3 Cache
All Iglu clients use internal cache to store registry responses. By virtue of it, it is absolutely safe to launch Hadoop/Spark jobs with Iglu client embedded as it will not generate enormous amount of IO calls.
1.3.1 Cache algorithm
Cache stores not just plain Schemas, but information about responses from each registry. It allows us to make different decisions depending on what exactly went wrong with particular request. Since Schema was successfuly fetched it will be stored until moment it get evicted by LRU cache algorithm. This eviction it turn happens only if cache map reached its limit (defined in cacheSize
) and particular Schema wasn't requested for longer time than all other.
1.3.2 Cache TTL
Since version 0.5.0, Iglu Scala Client supports cacheTtl
property. It is especially useful for real-time pipelines as they can store "failure" for very long time and TTL is a mechanism to ensure that day-long data won't go to bad stream. Note however that client also tries to re-resolve successfully fetched schemas, this allows operators to patch (re-upload) schemas without bringing pipeline down (although it is not recommended).
cacheTtl
is available since 1-0-2
version of resolver config.
2. Lookup algorithm
Overall, Schema Resolution algorithm can be described by following flowchart:
Few important things to note:
- If registry responded with "NotFound" error - "missing" value will be cached and this repository won't be queried again, until this "missing" value not evicted by LRU-algorithm
- If registry responded with error other than "NotFound", for example "TimeoutError", "NetworkError", "ServerFault" etc - "needToRetry" value will be cached and Resolver will give this registry 3 chances more. After three failed lookups - "missing" value will be cached
- These "missing" and "needToRetry" values in cache are per-registry, not per-schema, which means if
registryA
responded "NotFound" for Schemaiglu:com.acme/event/jsonschema/1-0-0
andregistryB
responded with TimeoutError - resolver will immediately abandonregistryA
and keep try to queryregistryB
for 3 more times.
3. Registry priority
For each particular Schema lookup, registries will be prioritized. In other words they will be sorted according following input parameters (ordered by their significance):
vendorPrefix
- Resolver always will look first into those registries whichvendorPrefix
es matchesSchemaKey
's vendor. It does not mean registries with unmatchedvendorPrefix
will be skipped, it means they will be queried last.classPriority
- hardcoded in client library value for each type of registry. It means that whatever high priority (low integer value) was set up in configuration for a particular registry - it will be overridden byclassPriority
, so embedded repository will always be checked before HTTP (unless priority influenced byvendorPrefix
)instancePriority
- user-defined value. Influence only repositories within sameclassPriority
.
One important thing to note is that both priorities (classPriority
and instancePriority
) order registries in ascending order. That means lower number means higher priority. Think of it as ascending list of number: [1,2,3,4]
- smaller will be always first.