Skip to main content

IAB enrichment

The IAB Spiders & Robots enrichment uses the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its’ IP address and user agent.

Spiders & bots are sometimes considered a necessary evil of the web. We want search engine crawlers to find our site, but we also don’t want a lot of non-human traffic clouding our reporting.

The Interactive Advertising Bureau (IAB) is an advertising business organization that develops industry standards, conducts research, and provides legal support for the online advertising industry.

Their internationally recognized list of spiders and bots is regularly maintained to try and identify the IP addresses of known bots and spiders.

Configuration

Testing with Micro

Unsure if your enrichment configuration is correct or works as expected? You can easily test it using Snowplow Micro on your machine. Follow the Micro usage guide to set up Micro and configure it to use your enrichment.

There are three fields that can be added to the parameters section of the enrichment configuration JSON:

  • ipFile
  • excludeUseragentFile
  • includeUseragentFile

They correspond to one of the IAB/ABC database files, and need to have two inner fields:

  • The database field containing the name of the database file.
  • The uri field containing the URI of the bucket in which the database file is found. This field supports http, https, gs and s3 schemes.

The table below describes the three types of database fields:

Field nameDatabase descriptionDatabase filename
ipFileBlacklist of IP addresses considered to be robots of spiders"ip_exclude_current_cidr.txt"
excludeUseragentFileBlacklist of useragent strings considered to be robots or spiders"exclude_current.txt"
includeUseragentFileWhitelist of useragent strings considered to be browsers"include_current.txt"

All three of these fields must be added to the enrichment JSON, as the IAB lookup process uses all three databases in order to detect robots and spiders. Note that the database files are commercial and proprietary and should not be stored publicly – for instance, on unprotected HTTPS or in a public S3 bucket.

Input

This enrichment uses the following fields of a Snowplow event:

  • useragent to determine an event’s user agent, which will be validated against the databases described in excludeUseragentFile and includeUseragentFile.
  • user_ipaddress to determine an event’s IP address, which will be validated against the database described in ipFile.
  • derived_tstamp to determine an event’s datetime. Some entries in the Spiders & Robots List can be considered “stale”, and will be given a category of INACTIVE_SPIDER_OR_ROBOT rather than ACTIVE_SPIDER_OR_ROBOT based on their age.

Output

This enrichment adds a new context to the enriched event with this schema.

Example:

{
"schema": "iglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0",
"data": {
"spiderOrRobot": false,
"category": "BROWSER",
"reason": "PASSED_ALL",
"primaryImpact": "NONE"
}
}
Was this page helpful?