I have been getting for over a week massive traffic from bots spread along multiple IP groups which are difficult to ban via CSF and at times slowing down significantly the forum. For example I would get 2000+ concurrent where normal visitors are about 100. Any advice on that would be appreciated.
Block in htaccess by bot type
If it helps, feel free to borrow from my robots.txt & .htaccess.
Since I change them regularly, I now keep them up on GitHub here:
https://github.com/sbulen/SMF-bot-hygiene
They may not suit your needs; consider it a starter pack.
Yes, things have gotten pretty bad with the bots... It's the wild wild west out there...
That great shawn
I ended up blocking all of China and Russia yesterday on a hardware firewall was just too much...
Also, note you can find lists with apache/other webserver configs to block on a country level at https://www.ip2location.com/free/visitor-blocker
I was getting smothered by activity from some US and European networks; I am assuming these are ISPs used by unscrupulous corporate entities trying to hide their identities. I.e., plagiarism engines (some call them AI bots)...
I ran a quick script to check, & confirmed none of my users were in their IP ranges.
Once I confirmed that, I basically cut off LOTs of IP ranges for FastPlanet, TrafficTransit, Fine Group & a couple others.
I didn't want to cut off Russia, because frankly, I have a fair amount of actual Russian members in my forum.
China is odd... A lot of the funky Alibaba/Huawei activity actually reports out as HK, not China. I've blocked a lot of HK IPs.
OTOH, I have about 3 or 4 valid users who participate in forum discussions with actual Chinese IP addresses, not HK...
Thank you all guys. I switched to directadmin with nginx_apache server so some things are not as simple as editing .htaccess anymore
@shawnb61 methinks googlebot does not accept Crawl-delay (https://support.google.com/webmasters/thread/251817470/putting-crawl-delay-in-robots-txt-file-is-good?hl=en) and the Crawl Rate Limiter Tool in Search Console has been depreciated (https://developers.google.com/search/blog/2023/11/sc-crawl-limiter-byebye?utm_source=wmx&utm_medium=deprecation-pane&utm_content=settings).
Quote from: spiros on February 13, 2025, 01:31:44 PM@shawnb61 methinks googlebot does not accept Crawl-delay (https://support.google.com/webmasters/thread/251817470/putting-crawl-delay-in-robots-txt-file-is-good?hl=en) and the Crawl Rate Limiter Tool in Search Console has been depreciated (https://developers.google.com/search/blog/2023/11/sc-crawl-limiter-byebye?utm_source=wmx&utm_medium=deprecation-pane&utm_content=settings).
Yep: https://www.simplemachines.org/community/index.php?topic=590038.0
I do see that they honor the disallows though. I've been checking what they're linking to, and altering robots.txt accordingly.
Dissallowing msg level links helps a lot, I believe. It's a waste anyway, they end up loading the same page over & over.
Admin > Server Settings > Disable hostname lookups
Check that. If bot traffic is heavy enough, hostname lookups can crash your forum.
Shawn,
question on your htaccess --
You use
BrowserMatchNoCase 01h4x.com bad_bot
My htaccess is using
SetEnvIfNoCase User-Agent "01h4x.com" bad_bot
Is there an appreciable difference in the approach?
my mySQL usage has tripled in the last few days (on a 2.0.19 forum) but no influx of actual users or posts - and the "bots visiting" in the who's online seems constant at 125+ at any point in any day
Yes, they're the same. You're good.
It'd help to have a sense of the activity they're doing... If you see a lot of message-level GETs in the web access logs, for me, it really helped to disallow those to robots.txt.
If your query report from your host shows a lot of session writes, that cumulatively add up to a significant #, you can make a 2.0 version of this in the write session function, near the top:
Quote// Don't bother writing the session if cookies are disabled; no way to retrieve it later
if (empty($_COOKIE))
return true;
Finally, I added a lot of IP bans for groups like FastPlanet, TransitTraffic, Fine Group, etc. I couldn't find any users in those IP ranges, and they were hitting me very hard.
For further clues that might help: https://github.com/sbulen/SMF-bot-hygiene
Despite having
Disallow: /forum/index.php?action=printpage
https://www.translatum.gr/robots.txt
I get many bots visiting print pages; is there a way to eliminate that? Perhaps some sort of JavaScript to create the print link?
Also, is this syntax acceptable? I.e. multiple user agents and at the end the Disallow.
User-agent: Zeus
User-agent: ZumBot
Disallow: /