IAB enrichment
The IAB Spiders & Robots enrichment uses the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its’ IP address and user agent.
Spiders & bots are sometimes considered a necessary evil of the web. We want search engine crawlers to find our site, but we also don’t want a lot of non-human traffic clouding our reporting.
The Interactive Advertising Bureau (IAB) is an advertising business organization that develops industry standards, conducts research, and provides legal support for the online advertising industry.
Their internationally recognized list of spiders and bots is regularly maintained to try and identify the IP addresses of known bots and spiders.
Configuration
Unsure if your enrichment configuration is correct or works as expected? You can easily test it using Snowplow Micro on your machine. Follow the Micro usage guide to set up Micro and configure it to use your enrichment.
There are three fields that can be added to the parameters
section of the enrichment configuration JSON:
ipFile
excludeUseragentFile
includeUseragentFile
They correspond to one of the IAB/ABC database files, and need to have two inner fields:
- The
database
field containing the name of the database file. - The
uri
field containing the URI of the bucket in which the database file is found. This field supportshttp
,https
,gs
ands3
schemes.
The table below describes the three types of database fields:
Field name | Database description | Database filename |
---|---|---|
ipFile | Blacklist of IP addresses considered to be robots of spiders | "ip_exclude_current_cidr.txt" |
excludeUseragentFile | Blacklist of useragent strings considered to be robots or spiders | "exclude_current.txt" |
includeUseragentFile | Whitelist of useragent strings considered to be browsers | "include_current.txt" |
All three of these fields must be added to the enrichment JSON, as the IAB lookup process uses all three databases in order to detect robots and spiders. Note that the database files are commercial and proprietary and should not be stored publicly – for instance, on unprotected HTTPS or in a public S3 bucket.
Input
This enrichment uses the following fields of a Snowplow event:
useragent
to determine an event’s user agent, which will be validated against the databases described inexcludeUseragentFile
andincludeUseragentFile
.user_ipaddress
to determine an event’s IP address, which will be validated against the database described inipFile
.derived_tstamp
to determine an event’s datetime. Some entries in the Spiders & Robots List can be considered “stale”, and will be given acategory
ofINACTIVE_SPIDER_OR_ROBOT
rather thanACTIVE_SPIDER_OR_ROBOT
based on their age.
Output
This enrichment adds a new context to the enriched event with this schema.
Example:
{
"schema": "iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0",
"data": {
"spiderOrRobot": false,
"category": "BROWSER",
"reason": "PASSED_ALL",
"primaryImpact": "NONE"
}
}