IAB bot detection enrichment
The IAB Spiders and Robots enrichment uses the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its IP address and user agent.
Spiders and bots are sometimes considered a necessary evil of the web. We want search engine crawlers to find our site, but we also don't want a lot of non-human traffic clouding our reporting.
The Interactive Advertising Bureau (IAB) is an advertising business organization that develops industry standards, conducts research, and provides legal support for the online advertising industry.
Their internationally recognized list of spiders and bots is regularly maintained to try and identify the IP addresses of known bots and spiders.
Configuration
Unsure if your enrichment configuration is correct or works as expected? You can easily test it using Snowplow Micro, either through Console or on your machine.
There are three fields that can be added to the parameters section of the enrichment configuration JSON:
ipFileexcludeUseragentFileincludeUseragentFile
They correspond to one of the IAB/ABC database files, and need to have two inner fields:
- The
databasefield containing the name of the database file. - The
urifield containing the URI of the bucket in which the database file is found. This field supportshttp,https,gsands3schemes.
The table below describes the three types of database fields:
| Field name | Database description | Database filename |
|---|---|---|
ipFile | Denylist of IP addresses considered to be robots or spiders | "ip_exclude_current_cidr.txt" |
excludeUseragentFile | Denylist of useragent strings considered to be robots or spiders | "exclude_current.txt" |
includeUseragentFile | Allowlist of useragent strings considered to be browsers | "include_current.txt" |
All three of these fields must be added to the enrichment JSON, as the IAB lookup process uses all three databases in order to detect robots and spiders. Note that the database files are commercial and proprietary and should not be stored publicly – for instance, on unprotected HTTPS or in a public S3 bucket.
Custom user agent lists
This feature is available since version 6.8.0 of Enrich.
In addition to the IAB database files, you can provide custom lists of user agent strings to supplement the detection. Two optional fields can be added to the parameters section:
| Field name | Description |
|---|---|
excludeUseragents | A list of user agent strings to be treated as robots or spiders (in addition to the excludeUseragentFile database) |
includeUseragents | A list of user agent strings to be treated as browsers (in addition to the includeUseragentFile database) |
Both fields accept a JSON array of strings. They are optional and default to empty lists if omitted.
A user agent matching excludeUseragents produces the following output values:
| Field | Value |
|---|---|
spiderOrRobot | true |
category | SPIDER_OR_ROBOT |
reason | FAILED_UA_EXCLUDE |
primaryImpact | UNKNOWN |
A user agent matching includeUseragents produces the following output values:
| Field | Value |
|---|---|
spiderOrRobot | false |
category | BROWSER |
reason | PASSED_ALL |
primaryImpact | NONE |
Example:
"excludeUseragents": ["my-custom-bot/1.0", "internal-crawler"],
"includeUseragents": ["my-legitimate-app/2.0"]
This is useful when you need to flag or allowlist specific user agents without modifying the IAB database files themselves.
These lists will take precedence over the IAB files. For example, a user agent added to includeUseragents will not be considered a robot even if it’s also in the excludeUseragentFile databaes.
Input
This enrichment uses the following fields of a Snowplow event:
useragentto determine an event's user agent, which will be validated against the databases described inexcludeUseragentFileandincludeUseragentFile.user_ipaddressto determine an event's IP address, which will be validated against the database described inipFile.derived_tstampto determine an event's datetime. Some entries in the Spiders and Robots List can be considered "stale", and will be given acategoryofINACTIVE_SPIDER_OR_ROBOTrather thanACTIVE_SPIDER_OR_ROBOTbased on their age.
Output
This enrichment adds a new context to the enriched event with this schema.
Example:
{
"schema": "iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0",
"data": {
"spiderOrRobot": false,
"category": "BROWSER",
"reason": "PASSED_ALL",
"primaryImpact": "NONE"
}
}