Skip to main content

Monitor failed events

Failed events are events the pipeline had some problem processing. Snowplow pipelines separate events that are problematic in order to keep data quality high in downstream systems.

We provide tooling for monitoring, loading, and recovering failed events.

Snowplow offers two different ways to monitor failed events in Snowplow Console:

  • The data quality dashboard that surfaces failed events directly from your warehouse in a secure manner, making debugging easier
  • The default view

You can also use the Console API to monitor failed events.

Monitoring dashboard comparison

The data quality dashboard and the default view are both useful in monitoring for failed events. Which one to choose depends on your requirements. This table summarises the differences:

FeatureData quality dashboardDefault view
RequirementsDeployed Failed Events LoaderNo additional infrastructure needed
Detail providedComplete failed events JSONAggregated data with redacted error messages
PII securityData never leaves your infrastructureRedacted data flows through Snowplow systems
Warehouse supportSnowflake and BigQuery onlyAll warehouses

Data quality dashboard

The data quality dashboard allows you to see failed events directly from your warehouse. Your browser connects directly to an API running within your infrastructure, such that no failed event information flows through Snowplow. The discussion of architecture is important as it highlights the trade-offs between the two ways of monitoring failed events. The aforementioned API is a simple proxy that connects to your warehouse and serves the failed events to your browser. The connection is fully secure, using an encrypted channel (HTTPS) and authenticating/authorizing via the same mechanism used by Console.

In order for you to be able to deploy and use the data quality dashboard, you need to sink failed events to the warehouse via a Failed Events Loader. The data quality dashboard currently supports Snowflake and Bigquery connections.

Architecture diagram showing user-generated events flowing through the pipeline to a warehouse, with the Data Quality API connecting directly to the warehouse, and the Console UI in the browser sending failed event queries to the Data Quality API while authenticating via the Console API.

In this setup, you expose an additional interface (Data Quality API) to the public internet, and all failed events information is served via that interface.

The user experience of the data quality dashboard is similar to the default view, but offers substantially more information to support resolution of tracking issues. The overview page looks as follows:

Data quality dashboard overview showing 15.29k total data volume (10.55k valid events, 4.73k failed events), a stacked bar chart of failed event volumes by day, and a table listing ValidationError issues by data structure, app ID, app version, first and last seen dates, volume, and trend.

As in the default view, it is possible to change the time horizon from the 7-day default, to the last hour, last day, or last 30 days. You get a corresponding overview of event volumes, both the successful and the failed ones, for the whole time period but also split per day/hour/minute in the bar chart right below.

Complementing the graphical overview, there is a table that lists all errors (at this moment only validation and resolution errors are supported). You can quickly see, for each type of failed event, a short description of the root cause, the offending data structure, first and last seen timestamps, and the total volume.

Clicking on a particular error type will take you to a detailed view:

Issue details page for a ValidationError showing the error message "3 classes (array, null, string) is required", the offending data structure and app IDs, and a sample data table with columns including COLLECTOR_TSTAMP and CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_FAILURE_1 containing JSON failure context.

The detailed view also shows a description of the root cause and the application version (web, mobile). It provides a sample of the failed events in their entirety, as found in your warehouse.

Some columns are too wide to fit in the table: click on them to see the full pretty-printed, syntax-highlighted content. The most useful column to explore is probably CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_FAILURE_1, which contains the actual error information encoded as a JSON object:

Value details modal displaying the full content of a CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_FAILURE_1 column as syntax-highlighted JSON, showing schema version 1-0-0, componentName snowplow-enrich-kinesis, and lookupHistory errors with NotFound responses from Iglu Central.

Finally, you can click on the View SQL query button to see the SQL query that was used to fetch the failed events from your warehouse:

View SQL query modal showing the SELECT statement used to fetch failed events from atomic_failed.events, including lateral flatten operations on the failure context column and a WHERE clause filtering by error hash and load timestamp range.

Missing warehouse permissions

When deploying a loader with the data quality add-on (API), you may encounter permission errors that prevent the dashboard from querying your warehouse.

If your service account lacks the required permission to create BigQuery jobs, you may receive the following error:

  • Error code: 21xxx
  • Description: Missing permission 'bigquery.jobs.create' on Bigquery...

Check if your service account has the required role:

bash
gcloud projects get-iam-policy <PROJECT_ID> \
--flatten="bindings[].members" \
--filter="bindings.members:<SERVICE_ACCOUNT_EMAIL>" \
--format="table(bindings.role)"

To fix the error, grant the required role to your service account (recommended):

bash
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
--role="roles/bigquery.jobUser"

Alternatively, if you need more granular control, create a custom role with only the bigquery.jobs.create permission:

bash
gcloud iam roles create customBigQueryJobCreator \
--project=<PROJECT_ID> \
--title="BigQuery Job Creator" \
--description="Create BigQuery jobs for Data Quality Dashboard" \
--permissions="bigquery.jobs.create"

gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
--role="projects/<PROJECT_ID>/roles/customBigQueryJobCreator"

Query timeouts

Long-running queries, a large volume of failed events, or resource pool exhaustion can cause the data quality dashboard to time out when fetching failed events. You may receive the following errors:

  • Error codes: 12xxx or 22xxx
  • Description: Query exceeded timeout or Query execution time limit exceeded

Diagnose the cause of the error by running:

sql
-- Check recent query performance
SELECT
job_id,
user_email,
total_slot_ms,
total_bytes_processed,
TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_seconds
FROM `<PROJECT_ID>.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND state = 'DONE'
AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 10;

To fix these errors, try:

  • Reducing query scope
    • For Console, try switching to using legacy failed events based on telemetry data
    • If using the API, try using smaller time windows, e.g., "Last hour" or "Last day" instead of "Last 30 days"
    • Query specific error types or schemas rather than all failed events
  • Optimizing warehouse performance
    • Review your warehouse configuration and query patterns
    • Consider implementing partitioning, clustering, or other optimization strategies
    • Monitor resource usage, and adjust warehouse size as needed

Default view

The default view is a relatively simple interface that shows the number of failed events over time alongside a coarse-grained description of the problem at hand. In certain cases, the error message can be somewhat cryptic in terms of diagnosing the root cause of the error. The reason is that this information flows through Snowplow infrastructure, therefore we have redacted it substantially before it leaves your pipeline, to ensure that PII does not traverse Snowplow systems. The aggregates are served by the Console's backend:

Architecture diagram showing user-generated events flowing through the pipeline, which contains a Failed Events Aggregator, to the warehouse. Redacted aggregates pass from the aggregator through Snowplow Infrastructure and the Console API to the Console UI in the browser.

In this setup, you expose no additional interface to the public internet, and all failed events information is served by the Console's APIs.

Below is an example view of the failed events screen in Snowplow Console:

Tracking quality report (30-day view) showing 502.51m total events with 2.31k failed events (0.00%), a bar chart of 30-day failure trends, and a failed events by type table listing two Validation errors: load_succeeded/2-0-0 with 1.80k failures and page_view/1-0-0 with 335 failures.

This interface is intended to give you a quick representation of the volume of event failures so that action can be taken. The UI focuses on schema violation and enrichment errors at present.

We've filtered on these two types of errors to reduce noise like bot traffic that causes adapter failures for instance. All failed events can be found in your S3 or GCS storage targets partitioned by type and then date/time.

At the top there is a data quality score that compares the volume of failed events to the volume that were successfully loaded into a data warehouse.

The bar chart shows how the total number of failed events varies over time, color-coding validation and enrichment errors. The time window is 7 days be default but can be extended to 14 or 30 days.

In the table, failed events are aggregated by the unique type of failure (validation, enrichment) and the offending schema.

By selecting a particular error you are able to get more detail:

Tracking failure details page for the page_view/1-0-0 schema showing error messages for $.documentLocationUrl and $.documentPath maxLength violations, vendor com.google.analytics.measurement-protocol, tracker com.google.analytics.measurement-protocol-v1, and a 30-day trend chart with 335 total failures most recently on 12 May.

The detailed view shows the error message as well as other useful metadata (when available), like app_id, to help diagnose the source and root cause of the error; as well as a bar chart indicating how the volume of failures varies over time for this particular failed event.

Console API

The API that powers the Console view and dashboard is publicly available, and can be invoked with a valid token to feed your own monitoring systems.

Before you can invoke the Failed Events API, you will need to authenticate with an API key.

A full specification of the API can be found in our swagger docs. It is worth pointing out that, as is the case in the UI, the data returned only contains schema validation errors and enrichment failures.