Skip to main content

IAB bot detection enrichment

The IAB Spiders and Robots enrichment uses the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its IP address and user agent.

The Interactive Advertising Bureau (IAB) is an advertising business organization that develops industry standards, conducts research, and provides legal support for the online advertising industry.

Their internationally recognized list of spiders and bots is regularly maintained to try and identify the IP addresses of known bots and spiders.

How the enrichment works

This enrichment performs several checks using the IAB database files and your custom override lists.

Here is the logic it uses. The values in parentheses are for the reason field in the IAB entity attached to the event.

flowchart TD IncList{{User agent in the custom **include** list?}}:::nowrap IncList -->|Yes| NotBot1(["Not bot"]) IncList -->|No| ExcList{{User agent in the custom **exclude** list?}}:::nowrap ExcList -->|Yes| Bot1("**Bot**<br/>(FAILED_UA_EXCLUDE)") ExcList -->|No| IpFile{{IP address in the IAB IP file?}}:::nowrap IpFile -->|Yes| Bot2("**Bot**<br/>(FAILED_IP_EXCLUDE)") IpFile -->|No| NullUA{{User agent is null?}}:::nowrap NullUA -->|Yes| NotBot2(["Not bot"]) NullUA -->|No| IncFile{{User agent in the IAB **include** file?}}:::nowrap IncFile -->|Yes| ExcFile{{User agent in the IAB **exclude** file?}}:::nowrap IncFile -->|No| Bot3("**Bot**<br/>(FAILED_UA_INCLUDE)") ExcFile -->|Yes| Bot4("**Bot**<br/>(FAILED_UA_EXCLUDE)") ExcFile -->|No| NotBot3(["Not bot"]) classDef nowrap white-space:nowrap

A user agent string will match one of the lists or files if it contains a string from that list or file. The matching is case-insensitive.

For example, the user agent string Chrome Chrome MyBot Chrome will match an entry named mybot.

Configuration

The enrichment takes these parameters:

ParameterRequiredDescription
ipFilePath to IP address exclude file. Already provided for CDI customers.
excludeUseragentFilePath to user agent exclude file. Already provided for CDI customers.
includeUseragentFilePath to user agent include file. Already provided for CDI customers.
includeUseragentsAdditional user agent patterns to classify as browsers, extending includeUseragentFile. Case-insensitive substring match.
excludeUseragentsAdditional user agent patterns to classify as spiders/robots, extending excludeUseragentFile. Case-insensitive substring match.

Configure the parameters in the Console enrichment editor. Keep the Console defaults for the uri fields. For example:

json
{
"ipFile": {
"database": "ip_exclude_current_cidr.txt",
"uri": "<use default value from Console>"
},
"excludeUseragentFile": {
"database": "exclude_current.txt",
"uri": "<use default value from Console>"
},
"includeUseragentFile": {
"database": "include_current.txt",
"uri": "<use default value from Console>"
},
"excludeUseragents": [
"BotNotCaughtByIAB"
],
"includeUseragents": [
"MyServerSideTrackerThatIsNotABot"
]
}
Testing with Micro

Unsure if your enrichment configuration is correct or works as expected? You can easily test it using Snowplow Micro, either through Console or on your machine.

IAB files

Snowplow CDI

If you're using Snowplow CDI, you don't need to configure these. Use the default values provided in Console.

There are three configuration fields that correspond to the IAB/ABC database files:

Field nameDescription
ipFileDenylist of IP addresses considered to be robots or spiders
excludeUseragentFileDenylist of user agent strings considered to be robots or spiders
includeUseragentFileAllowlist of user agent strings considered to be browsers

All three are mandatory and must have two inner fields:

  • The database field containing the name of the database file.
  • The uri field containing the URI of the bucket in which the database file is found. This field supports http, https, gs, and s3 schemes.

The database filenames must be as follows:

Field nameFilename
ipFile.database"ip_exclude_current_cidr.txt"
excludeUseragentFile.database"exclude_current.txt"
includeUseragentFile.database"include_current.txt"

Custom user agent lists

Availability

This feature is available since version 6.8.0 of Enrich.

In addition to the IAB database files, you can provide custom lists of user agent strings to supplement or override the detection. Two optional fields can be added to the parameters section:

Field nameDescription
excludeUseragentsA list of user agent strings to be treated as robots or spiders
includeUseragentsA list of user agent strings to be treated as browsers

Both fields accept a JSON array of strings. They are optional and default to empty lists if omitted.

A user agent matching excludeUseragents produces the following output values:

FieldValue
spiderOrRobottrue
categorySPIDER_OR_ROBOT
reasonFAILED_UA_EXCLUDE
primaryImpactUNKNOWN

A user agent matching includeUseragents produces the following output values:

FieldValue
spiderOrRobotfalse
categoryBROWSER
reasonPASSED_ALL
primaryImpactNONE

Example:

json
"excludeUseragents": ["my-custom-bot/1.0", "internal-crawler"],
"includeUseragents": ["my-legitimate-app/2.0"]

This is useful when you need to flag or allowlist specific user agents without modifying the IAB database files themselves.

Input

This enrichment uses the following fields of a Snowplow event:

  • useragent to determine an event's user agent, which will be validated against the databases described in excludeUseragentFile and includeUseragentFile, as well as the custom lists in excludeUseragents and includeUseragents
  • user_ipaddress to determine an event's IP address, which will be validated against the database described in ipFile
  • derived_tstamp to determine an event's datetime. Some entries in the Spiders and Robots List can be considered "stale", and will be given a category of INACTIVE_SPIDER_OR_ROBOT rather than ACTIVE_SPIDER_OR_ROBOT based on their age

Output

This enrichment adds a spiders_and_robots entity to the enriched event.

spiders_and_robots

Entity
Schema for an entity generated by the IAB Spiders and Robots enrichment
Schema URIiglu:com.iab.snowplow/spiders_and_robots/jsonschema/1-0-0
Example data
json
{
"spiderOrRobot": false,
"category": "BROWSER",
"reason": "PASSED_ALL",
"primaryImpact": "NONE"
}
Properties and schema
PropertyDescription
spiderOrRobot
boolean
Required. true if the IP address or user agent checked against the list is a spider or robot, false otherwise
category
Required. Category based on activity if the IP/UA is a spider or robot, BROWSER otherwise
Must be one of: SPIDER_OR_ROBOT, ACTIVE_SPIDER_OR_ROBOT, INACTIVE_SPIDER_OR_ROBOT, BROWSER
reason
Required. Type of failed check if the IP/UA is a spider or robot, PASSED_ALL otherwise
Must be one of: FAILED_IP_EXCLUDE, FAILED_UA_INCLUDE, FAILED_UA_EXCLUDE, PASSED_ALL
primaryImpact
Required. Whether the spider or robot would affect page impression measurement, ad impression measurement, both or none
Must be one of: PAGE_IMPRESSIONS, AD_IMPRESSIONS, PAGE_AND_AD_IMPRESSIONS, UNKNOWN, NONE

On this page

Want to see a custom demo?

Our technical experts are here to help.