IAB bot detection enrichment
The IAB Spiders and Robots enrichment uses the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its IP address and user agent.
The Interactive Advertising Bureau (IAB) is an advertising business organization that develops industry standards, conducts research, and provides legal support for the online advertising industry.
Their internationally recognized list of spiders and bots is regularly maintained to try and identify the IP addresses of known bots and spiders.
How the enrichment works
This enrichment performs several checks using the IAB database files and your custom override lists (both covered in the configuration section).
Here is the logic it uses. The values in parentheses are for the reason field in the IAB entity attached to the event.
A user agent string will match one of the lists or files if it contains a string from that list or file. The matching is case-insensitive.
For example, the user agent string Chrome Chrome MyBot Chrome will match an entry named mybot.
Configuration
Unsure if your enrichment configuration is correct or works as expected? You can easily test it using Snowplow Micro, either through Console or on your machine.
IAB files
There are three configuration fields that correspond to the IAB/ABC database files:
| Field name | Description |
|---|---|
ipFile | Denylist of IP addresses considered to be robots or spiders |
excludeUseragentFile | Denylist of user agent strings considered to be robots or spiders |
includeUseragentFile | Allowlist of user agent strings considered to be browsers |
All three are mandatory and must have two inner fields:
- The
databasefield containing the name of the database file. - The
urifield containing the URI of the bucket in which the database file is found. This field supportshttp,https,gs, ands3schemes.
If you use Snowplow CDI, the necessary files are already provided and updated by Snowplow. You can see the pre-configured URIs of these files in the default enrichment configuration in Console.
The database filenames must be as follows:
| Field name | Filename |
|---|---|
ipFile.database | "ip_exclude_current_cidr.txt" |
excludeUseragentFile.database | "exclude_current.txt" |
includeUseragentFile.database | "include_current.txt" |
Custom user agent lists
This feature is available since version 6.8.0 of Enrich.
In addition to the IAB database files, you can provide custom lists of user agent strings to supplement or override the detection. Two optional fields can be added to the parameters section:
| Field name | Description |
|---|---|
excludeUseragents | A list of user agent strings to be treated as robots or spiders |
includeUseragents | A list of user agent strings to be treated as browsers |
Both fields accept a JSON array of strings. They are optional and default to empty lists if omitted.
A user agent matching excludeUseragents produces the following output values:
| Field | Value |
|---|---|
spiderOrRobot | true |
category | SPIDER_OR_ROBOT |
reason | FAILED_UA_EXCLUDE |
primaryImpact | UNKNOWN |
A user agent matching includeUseragents produces the following output values:
| Field | Value |
|---|---|
spiderOrRobot | false |
category | BROWSER |
reason | PASSED_ALL |
primaryImpact | NONE |
Example:
"excludeUseragents": ["my-custom-bot/1.0", "internal-crawler"],
"includeUseragents": ["my-legitimate-app/2.0"]
This is useful when you need to flag or allowlist specific user agents without modifying the IAB database files themselves.
Input
This enrichment uses the following fields of a Snowplow event:
useragentto determine an event's user agent, which will be validated against the databases described inexcludeUseragentFileandincludeUseragentFile, as well as the custom lists inexcludeUseragentsandincludeUseragentsuser_ipaddressto determine an event's IP address, which will be validated against the database described inipFilederived_tstampto determine an event's datetime. Some entries in the Spiders and Robots List can be considered "stale", and will be given acategoryofINACTIVE_SPIDER_OR_ROBOTrather thanACTIVE_SPIDER_OR_ROBOTbased on their age
Output
This enrichment adds a new context to the enriched event with this schema.
Example:
{
"schema": "iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0",
"data": {
"spiderOrRobot": false,
"category": "BROWSER",
"reason": "PASSED_ALL",
"primaryImpact": "NONE"
}
}