Bots, crawlers, spiders vs. PHPSESSID

Started by MegaBrutal, November 06, 2024, 09:00:00 PM

Previous topic - Next topic

MegaBrutal

From time to time, I get bots bombarding my site, downloading the same pages with different PHPSESSID values in the query string over again continuously. This is quite noticeable as "Who's online" populating with hundreds of guest visitors from various IPs. Recently it's Amazonbot. Adding "Disallow: /*PHPSESSID=*" to robots.txt doesn't help.

Searching the net and this forum I got the following information from various sources, which doesn't help and in some places even contradicting:

  • Enable session.use_only_cookies in php.ini, so PHPSESSID won't get appended to URLs cookieless downloads. Turns out, this is enabled by default and even enabled for my PHP environment.
  • There is no way around appending PHPSESSID to URLs, this happens either way. Well, even this very forum doesn't append PHPSESSIDs, as simply checking with "curl https://www.simplemachines.org/community/ | grep PHPSESSID" reveals. Somehow there must be a way then.
  • Block the bots by IP ranges or User-Agent with Rewrite rules. Well, I could certainly do this and even did it to several other bots, but for Amazon, I'm quite hesitant. Some bots might be useful indexing the site and perhaps getting more legitimate visitors.

Could we have a thread here that finally provides an optimal solution? :) I'd rather guide the supposedly benevolent bots to behave correctly, instead of blocking them altogether.

Arantor

SMF adds PHPSESSID (even here) if the request doesn't come with cookies, and the presumption is made that the lack of cookies requires passing the session via URL instead.

You could remove that code but honestly I don't think it would make any difference for you: you seem to have the same situation that the rest of us have seen, that you're bombarded by bots who come along and start new sessions because they're not carrying the cookies but also by discarding some of the instances of PHPSESSID too.

The main solution really is to block them as *legitimate* bots actually don't behave this way. What bots are you seeing? (Check your server logs.)
Holder of controversial views, all of which my own.


MegaBrutal

It's Amazonbot, coming from hundreds of different IPs.

34.195.60.66 - - [06/Nov/2024:04:30:03 +0100] "GET /index.php?cat=1;PHPSESSID%3Dkdgur4a2kppeloh5komneuf9ii HTTP/1.1" 200 7576 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)"

The IPs are forward-confirmed, so it's not just someone pretending to be Amazon, it's really them.

$ host 34.195.60.66
66.60.195.34.in-addr.arpa domain name pointer 34-195-60-66.crawl.amazonbot.amazon.
$ host 34-195-60-66.crawl.amazonbot.amazon.
34-195-60-66.crawl.amazonbot.amazon has address 34.195.60.66

To be honest, I'm also puzzled why it makes so many requests, like it was malicious, so I wonder whether it's a bug on their side. They're supposed to know how to deal with PHPSESSID, I guess.

Previously I had similar problems with other bots, notably Bytedance, but I found that lots of admins consider them malicious, so I didn't have any concern about blocking them.

I also wonder whether hxxp:asperger.hu/ [nonactive] is misconfigured somehow, but I don't think so. I'm considering to upgrade to SMF 2.1 in the following months, I'll be curious whether it changes anything regarding PHPSESSID links.

Arantor

2.1 doesn't change the behaviour with PHPSESSID, it still forcibly includes it if the request doesn't have cookies. And it's not misconfigured, this is how SMF has always behaved.

(The only difference is this site doesn't call it PHPSESSID, but IIRC it uses just P)

I'd personally just block Amazonbot by user agent, you're not getting search traffic from them so no point them freeloading off you.
Holder of controversial views, all of which my own.


Advertisement: