Web Crawlers and Scrapers

Table of Contents

Background

Many service providers wish to restrict certain Web activities including spambot registration, web crawlers and more. LLM-associated user agents are an increasing source of traffic, consuming network resources, driving up individual hosting costs, as well as storing content it finds forever.

Some of these bots honour the robots.txt directives, but it’s up to each individual domain admin to get that file in place, on each domain and sub-domain you oversee. For other bots that ignore or work around robots.txt, other measures need to be taken.

Community Resources

https://kolektiva.social/@ophiocephalic/115011403604370053 – will share links and resources for self CDN, and DDOS protection
https://eupolicy.social/about – a terms of service that prohibits machine learning or research
https://lobste.rs/s/s9yq5a/block_ai_scrapers_with_anubis – discussion around using Anubis and other methods

Crawler Identification Resources

Dark Visitors – A List of Known AI Agents on the Internet
- Automated robots.txt file for WordPress
udger crawler UA list
Eight tips about consent for fediverse developers – background on consent in the Fediverse as it relates to web scraping
https://mastodon.bentasker.co.uk/@scrapersnitch – automated OSINT account to notify of possible Fediverse scrapers
https://momenticmarketing.com/blog/ai-search-crawlers-bots#complete-ai-crawler-list – routinely updated list of AI Search Web crawlers

Managed Blocks

Both Cloudflare and Fastly offer detection and blocking.

Notable Bots

Amazon

Amazonbot is Amazon’s web crawler used to improve services, such as enabling Alexa to answer questions for customers. Amazonbot respects standard robots.txt rules.

To disallow Amazonbot using robots.txt:

User-agent: Amazonbot
Disallow: /

Anthropic AI

Anthropic is a U.S.-based AI company researching artificial intelligence as a public-benefit company to develop AI systems to “study their safety properties at the technological frontier” and use this research to deploy safe, reliable models for the public. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI’s ChatGPT and Google’s Gemini.

To disallow ClaudeBot using robots.txt:

User-agent: ClaudeBot 
User-agent: claude-web
Disallow: /

AppleBot-Extended

With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent.

To disallow AppleBot-Extended using robots.txt:

User-agent: AppleBot-Extended
Disallow: /

Additional information, including information about Apple’s web crawler “AppleBot”, is available at https://support.apple.com/en-us/119829

CCBot

Common Crawl is a non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone. The user agent is CCBot/2.0.

To disallow CCBot using robots.txt:

User-agent: CCBot
Disallow: /

FacebookBot

FacebookBot crawls public web pages to improve language models for speech recognition technology.

User agent: FacebookBot
Full user-agent string: Mozilla/5.0 (compatible; FacebookBot/1.0; +https://developers.facebook.com/docs/sharing/webmasters/facebookbot/)

To disallow FacebookBot using robots.txt:

User-agent: FacebookBot
User-agent: meta-externalagent
Disallow: /

Google

Google-Extended is Google’s Gemini/Vertex AI/Bard crawler. To disallow the Bard crawler:

User-agent: Google-Extended
Disallow: /

This does not stop all Google generative AI crawls. Google also scrapes content for AI-powered search results. To stop this, you will need to block the main Googlebot, which will also remove your site from Google Search.

User-agent: Googlebot

Disallow: /

To Disallow all Google bot traffic by IP address, see this json file

Microsoft

User-agent: BingBot
Disallow: /

OpenAI

GPTBot is OpenAI’s web crawler and can be identified by the following user agent and string:

User agent: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

To disallow GPTBot using robots.txt:

User-agent: GPTBot
Disallow: /

There are two user agents, GPTBot and GPT-User (for human user-initiated browsing via ChatGPT), opting out of either will block both.

To disallow GPTBot traffic by source IP:

52.230.152.0/24
52.233.106.0/24

To disallow GPT-User traffic by IP:

23.98.142.176/28
40.84.180.224/28
13.65.240.240/28
20.97.189.96/28

(reference: https://openai.com/gptbot.json and https://platform.openai.com/docs/plugins/bot)

PerplexityBot

Perplexity AI is an AI chatbot-powered research and conversational search engine that answers queries using natural language predictive text.

To disallow PerplexityBot using robots.txt:

User-Agent: PerplexityBot
Disallow: /

To disallow PerplexityBot traffic by IP:

54.90.207.250/32
23.22.208.105/32
54.242.1.13/32
18.208.251.246/32
34.230.5.59/32/22
18.207.114.171/32
54.221.7.250/32

(reference: https://docs.perplexity.ai/docs/perplexitybot and https://www.perplexity.ai/perplexitybot.json)

PetalBot

PetalBot is a web crawler developed by Huawei to index content for its Petal Search engine, which powers services like Huawei Assistant and AI Search. It is alsoknown as AspiegelBot.

To disallow PetalBot using robots.txt:

User-agent: PetalBot
Disallow: /

YaK

The YaK crawler is operated by Linkfluence, which was acquired by Meltwater. This bot gathers and analyses data from websites and social media platforms. Its main function is to monitor brand mentions and analyse market.

To disallow YaK using robots.txt:

User-agent: YaK
Disallow: /

Yandex

Yandex is the leading Russian search engine, and operates a number of crawlers.

To disallow all Yandex crawlers using robots.txt:

User-agent: Yandex*
Disallow: /