Understanding failed events
Failed events is an umbrella term for events that the pipeline had some problem processing.
These problems can arise at various stages:
- Collection (e.g. invalid payload format)
- Validation (e.g. event does not match the schema)
- Enrichment (e.g. external API unavailable)
- Loading (very unlikely with modern versions of Snowplow)
Failed events are not written to your atomic
events table, which only contains high quality data. See below for how to deal with them.
Common failures
The two most common types of failed events are:
Validation failures. These happen when an event or an entity does not match its schema. The reason usually is incorrect tracking code. Validation can also fail if the schema is not available, e.g. because you forgot to add it to the production schema registry before putting the tracking code live.
Enrichment failures. These can be due to an API enrichment reaching out to an external API that’s down. Another cause is a failure in the custom code in a JavaScript enrichment.
In many cases, you will be able to fix the underlying problem directly, e.g. by altering your tracking code, by providing the correct schema, or by changing your enrichment configuration.
Other failures
Other failures generally fall into 3 categories:
Bots or malicious activity. Bots, vulnerability scans, and so on, can send completely invalid events to the Collector. The format might be wrong, or the payload size might be extraordinarily large.
Pipeline misconfiguration. For example, a loader could be reading from the wrong stream (with events in the wrong format). This is quite rare, especially for Snowplow BDP, where all relevant pipeline configuration is automatic.
Temporary infrastructure issue. This is again rare. One example would be Iglu Server (schema registry) not being available.
All of these are internal failures you typically can’t address upstream.
Dealing with failed events
Snowplow BDP provides a dashboard and alerts for failed events. See Monitoring failed events.
For the common failures (validation and enrichment), you can configure continuous loading of any offending events into a separate table in your warehouse or lake. This way, you can easily inspect them and decide how they might be patched up (e.g. with SQL) and merged with the rest of your data.
This feature is not retroactive, i.e. only failed events that occur after it’s enabled will be loaded into your desired destination.
The events will include a special column with the details of the failure, and any invalid columns will be set to null
. Otherwise, the format is the same as for your atomic events.
See Exploring failed events for more details and setup instructions.
Finally, on AWS and GCP all failed events are backed up in object storage (S3 and GCS respectively). Sometimes, but not in all cases (e.g. not if the original events exceeded size limits), it’s possible to recover them by replaying them through the pipeline. This is a complicated process mainly reserved for internal failures and outages. Refer to Recovering failed events.
You can find a full list of failed event types in the API reference section.