AI bots everywhere – does anyone have a good whitelist for robots.txt?

Question

My niche little site, http://golfcourse.wiki seems to be very popular with AI bots. They basically become most of my traffic. Most of them follow robots.txt, and that's nice and all, but they are costing me non-trivial amounts of money.I don't want to block most search engines. I don't want to block legitimate institutions like archive.org. Is there a whitelist that I could crib instead of pretty much having to update my robots file every damn day?

ryukoposting · Accepted Answer

I operate under the assumption that Google, OpenAI, Anthropic, Bytedance et al either totally ignore it or only follow it selectively. I haven't touched my Robots.txt in a while. Instead, I have nginx return empty and/or bogus responses when it sees those UA substrings.

gigaArpit · Answer

Robots.txt is a joke, use Cloudflare's Block AI Bots feature if you are using Cloudflare.

captn3m0 · Answer

I&rsquo;ve been using https://github.com/ai-robots-txt/ai.robots.txt on my Gitea instance.

andrewfromx · Answer

robots.txt will just be ignored by bad actors.
And the user-agent sent is completely in control of the bad actor. They can send their user-agent as "google bot".
You would need something like WAF from https://www.cloudflare.com/ or AWS

blakesterz · Answer

I found that robots.txt does an OK job. It doesn't block everything, but I wouldn't run without one because many really busy bots do follow the rules. It's simple, cheap and does knock out a bunch of traffic. The AWS WAF has some challenge rules that I found work great stopping some of the other crappy and aggressive bots. And then the last line of defense is some addition nginx blocks. Those three layers really got things under control.

palmfacehn · Answer

Blocking Amazon, Huawei, China Telecom and few other ASNs did it for me. You should be able to cut, sort -u your log files to find the biggest offenders by user-agent if they truly obey robots.txt.

pdntspa · Answer

It's so cute how people think anyone obeys robots.txt :)

aetherspawn · Answer

Use Cloudflare (free) and just disable bots or add a captcha.You'll probably also save money if you enable and configure free caching.

trod1234 · Answer

Most of the major players seem to be actively ignoring robots.txt.They claim they don't do this, but those claims amount to lies. The access logs shows what they actually do, and its gotten so egregious that some sites have buckled under resource exhaustion caused by these bad actors. (DDOS attack).

dankwizard · Answer

People pointing at AI are right, but also, I've done a lot of scraping for personal sites and small side hustles and not once even been concerned to check for robots.txt.

xyzal · Answer

Recently discussed ... https://marcusb.org/hacks/quixotic.html

keisborg · Answer

An old trick is to add a page to the robots disallow list, but the page should also be findable by crawlers. If a bot visits this page, you know it&rsquo;s a bad actor.

_blk · Answer

Takes balls to put the URL in a HN post if you're looking to reduce traffic costs.. Of course it's legit traffic for once but I assume curiosity just got the better half of them.. .. On the constructive side: shield it with cloudflare.

userbinator · Answer

Just ratelimit.

anxixddjs · Answer

sorry cant help - but really cool site!

AI bots everywhere – does anyone have a good whitelist for robots.txt?

I operate under the assumption that Google, OpenAI, Anthropic, Bytedance et al either totally ignore it or only follow it selectively. I haven't touched my Robots.txt in a while. Instead, I have nginx return empty and/or bogus responses when it sees those UA substrings.

Robots.txt is a joke, use Cloudflare's Block AI Bots feature if you are using Cloudflare.

I’ve been using https://github.com/ai-robots-txt/ai.robots.txt on my Gitea instance.

robots.txt will just be ignored by bad actors.
And the user-agent sent is completely in control of the bad actor. They can send their user-agent as "google bot".
You would need something like WAF from https://www.cloudflare.com/ or AWS

Blocking Amazon, Huawei, China Telecom and few other ASNs did it for me. You should be able to cut, sort -u your log files to find the biggest offenders by user-agent if they truly obey robots.txt.

It's so cute how people think anyone obeys robots.txt :)

Use Cloudflare (free) and just disable bots or add a captcha.
You'll probably also save money if you enable and configure free caching.

People pointing at AI are right, but also, I've done a lot of scraping for personal sites and small side hustles and not once even been concerned to check for robots.txt.

Recently discussed ... https://marcusb.org/hacks/quixotic.html

An old trick is to add a page to the robots disallow list, but the page should also be findable by crawlers. If a bot visits this page, you know it’s a bad actor.

Takes balls to put the URL in a HN post if you're looking to reduce traffic costs.. Of course it's legit traffic for once but I assume curiosity just got the better half of them.. .. On the constructive side: shield it with cloudflare.

Just ratelimit.

sorry cant help - but really cool site!

AI bots everywhere – does anyone have a good whitelist for robots.txt?

I operate under the assumption that Google, OpenAI, Anthropic, Bytedance et al either totally ignore it or only follow it selectively. I haven't touched my Robots.txt in a while. Instead, I have nginx return empty and/or bogus responses when it sees those UA substrings.

Robots.txt is a joke, use Cloudflare's Block AI Bots feature if you are using Cloudflare.

I’ve been using https://github.com/ai-robots-txt/ai.robots.txt on my Gitea instance.

robots.txt will just be ignored by bad actors.And the user-agent sent is completely in control of the bad actor. They can send their user-agent as "google bot".You would need something like WAF from https://www.cloudflare.com/ or AWS

Blocking Amazon, Huawei, China Telecom and few other ASNs did it for me. You should be able to cut, sort -u your log files to find the biggest offenders by user-agent if they truly obey robots.txt.

It's so cute how people think anyone obeys robots.txt :)

Use Cloudflare (free) and just disable bots or add a captcha.You'll probably also save money if you enable and configure free caching.

People pointing at AI are right, but also, I've done a lot of scraping for personal sites and small side hustles and not once even been concerned to check for robots.txt.

Recently discussed ... https://marcusb.org/hacks/quixotic.html

An old trick is to add a page to the robots disallow list, but the page should also be findable by crawlers. If a bot visits this page, you know it’s a bad actor.

Takes balls to put the URL in a HN post if you're looking to reduce traffic costs.. Of course it's legit traffic for once but I assume curiosity just got the better half of them.. .. On the constructive side: shield it with cloudflare.

Just ratelimit.

sorry cant help - but really cool site!

robots.txt will just be ignored by bad actors.
And the user-agent sent is completely in control of the bad actor. They can send their user-agent as "google bot".
You would need something like WAF from https://www.cloudflare.com/ or AWS

Use Cloudflare (free) and just disable bots or add a captcha.
You'll probably also save money if you enable and configure free caching.