How to stop an AWS bot sending 2B requests/month
I have been struggling with a bot– 'Mozilla/5.0 (compatible; crawler)' coming from AWS Singapore – and sending an absurd number of requests to a domain of mine, averaging over 700 requests/second for several months now.
Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.
I've submitted several complaints to AWS to get this traffic to stop, their typical followup is:
We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.
I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.
The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.
I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.
Are there others that have similar experience?
Hire a lawyer and have him send the bill for his services to them immediately with a note on the consequences of ignoring his notices. Bill them aggressively.
Do you receive, or expect to receive any legitimate traffic from AWS Singapore? If not, why not blackhole the whole thing?
Making the obviously-abusive bot prohibitively expensive is one way to go, if you control the terminating server.
gzip bomb is good if the bot happens to be vulnerable, but even just slowing down their connection rate is often sufficient - waiting just 10 seconds before responding with your 404 is going to consume ~7,000 ports on their box, which should be enough to crash most linux processes (nginx + mod-http-echo is a really easy way to set this up)
If it follows redirects, have you tried redirecting it to its own domain?
if it follows redirect, redirct him to a 10gb gzip bomb
Null-route the entirety of AWS ip space.
You don't even need to send a response. Just block the traffic and move on
Just find a Hoster with low traffic egress cost, reverse proxy normal traffic to Cloudflare and reply with 2GB files for the bot, they annoy you/cost you money, make them pay.
block the IPs or setup an WAF on AWS if you cannot be on Cloudflare.
I had this issue on one of my personal sites. It was a blog I used to write maybe 7-8 years ago. All of a sudden, I see insane traffic spikes in analytics. I thought some article went viral, but realized it was too robotic to be true.
And so I narrowed it down to some developer trying to test their bot/crawler on my site. I tried asking nicely, several times, over several months.
I was so pissed off that I setup a redirect rule for it to send them over to random porn sites. That actually stopped it.
It's a reverse-proxy / load balancer with built-in firewall and automatic HTTPS. You will be able to easily block the annoying bots with rules (https://pingoo.io/docs/rules)
Have ChatGPT write you a sternly worded cease and desist letter and send it to Amazon legal via registered mail.
AWS has become rather large and bloated and does stupid things sometimes, but they do still respond when you get their lawyers involved.
What kind of content do you serve? 700 RPS is not a big number at all, for sure not enough to qualify as a DoS. I'm not surprised AWS did not take any action.
> Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.
This is from your own post, and is almost the best answer I know of.
I recommending you configure a Cloudflare WAF rule to block the bot - and then move on with your life.
Simply block the bot and move on with your life.
As others have suggested you can try to fight back depending on the capabilities of your infrastructure. All crawlers will have some kind of queuing system. If you manage to cause for the queues to fill up then the crawler wont be able to send as many requests. For example, you can allow the crawler to open the socket but you only send the data very slowly causing the queues to get filled quickly with busy workers.
Depending on how the crawler is designed this may or may not work. If they are using SQS with Lambda then that will obviously not work but it will fire back nevertheless because the serverless functions will be running for longer (5 - 15 minutes).
Another technique that comes to mind is to try to force the client to upgrade the connection (i.e. websocket). See what will happen. Mostly it will fail but even if it gets stalled for 30 seconds that is a win.
> I've tried 30X redirects (which it follows) to no avail
Make it follow redirects to some kind of illegal website. Be creative, I guess.
The reasoning being that if you can get AWS to trigger security measures on their side, maybe AWS will shut down their whole account.
if they have some service up on the machines the bot connect from then u can redirect them to themselves.
otherwise, maybe redirect to aws customer portal or something -_- maybe they will stop it if it hit themselves...
If it follows the redirect I would redirect it to random binary files hosted by Amazon, then see if it continues to not require any further action
What kind of website is this that makes it so lucrative to run so many requests?
I am dealing with a similar situation and kinda screwed up as I managed to get Google Ads suspended due to blocking Singapore. I see a mix of traffic from AWS, Tencent and Huawei cloud at the moment.
Currently Im just scanning server logs and blocking ip ranges.
Silly suggestion: feed them bogus DNS info. See if you can figure out where their DNS requests are coming from.
Use a simple block rule, not a WAF rule, those are free.
An idea I had was a custom kernel that replied ACK (or SYN+ACK) to every TCP packet. All connections would appear to stay open forever, eating all incoming traffic, and never replying, all while using zero resources of the device. Bots might wait minutes (or even forever) per connection.
I redirect such traffic to a subdomain with an IP address that isn't assigned (or legally assignable). The bots just wait for a response to connection requests but never gets them. This seems to typically cost 10s waiting. The traffic doesn't come to my servers and it doesn't risk legitimate users who might hit it by mistake.
> I've tried 30X redirects (which it follows)
301 response to a selection of very large files hosted by companies you don't like.
When their AWS instances start downloading 70000 windows ISOs in parallel, they might notice.
Hard to do with cloudflare but you can also tar pit them. Accept the request and send a response, one character at a time (make sure you uncork and flush buffers/etc), with a 30 second delay between characters.
700 requests/second with say 10Kb headers/response. Sure is a shame your server is so slow.
Main author of Anubis here. Have CloudFlare return a HTTP 200 response instead of a rejection at non-200. That makes the bots stop hammering until they get a 200 response.
Dumb question but just cuz I didn’t see it mentioned have you tried using a Disallow: / in your robots.txt? Or Crawl-delay: 10? That would be the first thing I would try.
Sometimes these crawlers are just poorly written not malicious. Sometimes it’s both.
I would try a zip bomb next. I know there’s one that is 10 MB over the network and unzips to ~200TB.
Blocking before the traffic reaches the application servers (what you're doing) is the most effective and cost/time efficient.
It sounds like the bot operator is spending enough on AWS to withstand the current level of abuse reports.
If you really wanted to retaliate, you could try getting a warrant to force AWS to disclose the owners of that AWS instance.
Tell cloudflare it's abusive, and they will block it outside your account so it doesn't count against you.
tirreno(1) guy here.
I'd suggest taking a look into patterns and IP rotation (if any) and perhaps blocking IP CIDR at the web server level, if the range is short.
Why simple deny from 12.123.0.0/16 (Apache) is not working for you?
1. https://github.com/tirrenotechnologies/tirreno