I was wondering what lines I should put in my robots.txt
file to at least attempt to stop #GenAI from pillaging my content. Turns out you can get a nice list from the robots.txt
files at big web properties:
Searching on search engines just turns up a bunch of "both sides" articles that try to tell you why it might be good or might be bad, but not how you should realistically do it.
I also found this article informative [www.bing.com], that #Microsoft's #AI crawler will honour some <html>
meta tags.
In case you're wondering what I settled on, here it is. A bunch of #AI web crawlers.
Is it all of them? Who knows.
Why do all this? Well, I'm kinda OK with Google Search and Bing Search indexing my pages. I'm kinda not OK with most of the other crap.
I think I have to "opt out" instead of opt-in.
# 80 Legs
User-agent: 008
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: AwarioRssBot
Disallow: /
User-agent: AwarioSmartBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: EtaoSpider
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: magpie-crawler
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: peer39_crawler
Disallow: /
User-agent: peer39_crawler/1.0
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: PiplBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: Twitterbot
Disallow: /
# 80legs' new crawler
User-agent: voltron
Disallow: /
@paco I've just been hit by "ClaudeBot" from anthropic dot com
from IPs:
3.17.150.89
18.191.88.249
18.118.31.247
18.221.208.183
I wrote this small nginx server rule which will block the bot (and can work for others), especially if the bots are less polite with regards to robots.txt
if ($http_user_agent ~* "ClaudeBot") {
return 444;
}
if ($http_user_agent ~* "claudebot@anthropic.com") {
return 444;
}