Essem @esm

**Paco Hope #resist** @paco@infosec.exchange · Feb 28, 2024

Paco Hope #resist @paco@infosec.exchange

I was wondering what lines I should put in my robots.txt file to at least attempt to stop #GenAI from pillaging my content. Turns out you can get a nice list from the robots.txt files at big web properties:

Searching on search engines just turns up a bunch of "both sides" articles that try to tell you why it might be good or might be bad, but not how you should realistically do it.

I also found this article informative [www.bing.com], that #Microsoft's #AI crawler will honour some <html> meta tags.

Feb 28, 2024

**Paco Hope #resist** @paco@infosec.exchange · Feb 28, 2024

Paco Hope #resist @paco@infosec.exchange

In case you're wondering what I settled on, here it is. A bunch of #AI web crawlers.

Is it all of them? Who knows.

Why do all this? Well, I'm kinda OK with Google Search and Bing Search indexing my pages. I'm kinda not OK with most of the other crap.

I think I have to "opt out" instead of opt-in.

# 80 Legs
User-agent: 008
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: AwarioRssBot
Disallow: /

User-agent: AwarioSmartBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: DataForSeoBot
Disallow: /

User-agent: EtaoSpider
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: magpie-crawler
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: peer39_crawler
Disallow: /

User-agent: peer39_crawler/1.0
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: PiplBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: Twitterbot
Disallow: /

# 80legs' new crawler
User-agent: voltron
Disallow: /

Feb 28, 2024

Jessie Nabein @jessienab@wetdry.world

@paco I've just been hit by "ClaudeBot" from anthropic dot com

from IPs:

3.17.150.89
18.191.88.249
18.118.31.247
18.221.208.183

I wrote this small nginx server rule which will block the bot (and can work for others), especially if the bots are less polite with regards to robots.txt

if ($http_user_agent ~* "ClaudeBot") {
  return 444;
}

if ($http_user_agent ~* "claudebot@anthropic.com") {
  return 444;
}

#FuckAI #ClaudeBot #nginx

Apr 19, 2024, 07:59 PM·

0boosts·1favorite·0reactions

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Posts and replies