News:

Wondering if this will always be free?  See why free is better.

Main Menu

PHPSESSID in URL when fetched as crawler

Started by Pyrhel, June 02, 2025, 09:18:57 AM

Previous topic - Next topic

Pyrhel

Hello everyone,

I noticed that the count of blocked pages by robots.txt are getting more and more (currently 200k). I've tried to stop multiple indexing of the same content by the following robots.txt (https://fordbg.com/robots.txt) for years and everything was OK:
User-agent: *
Disallow: /*wap*
Disallow: /*wap2*
Disallow: /*imode*
Disallow: /*action=dlattach*
Disallow: /*action=profile*
Disallow: /*action=calendar*
Disallow: /*action=printpage*
Disallow: /forum/sitemap/
Disallow: /forum/Themes/
Disallow: /tmp/
Disallow: /cron.php
Disallow: /*?PHPSESSID=* <=== IMPORTANT LINE

When I checked one of the blocked urls in google console (https://fordbg.com/forum/index.php?msg=36314) it seemed strange why the crawler marks it as blocked by the robots file, so i did
curl -i -A "Googlebot" https://fordbg.com/forum/index.php?msg=36314 with the following result:
HTTP/1.1 302 Found <=== IMPORTANT LINE
Date: Mon, 02 Jun 2025 13:04:29 GMT
Server: Apache
X-Powered-By: PHP/8.3.21
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
cache-control: private
x-frame-options: DENY
x-xss-protection: 1
x-content-type-options: nosniff
Vary: Accept-Encoding
Set-Cookie: PHPSESSID=pohbhcd0u9sls464kdj4qr49d8; path=/
Strict-Transport-Security: max-age=15552000; includeSubDomains
Upgrade: h2,h2c
Connection: Upgrade
location: https://fordbg.com/forum/index.php?PHPSESSID=pohbhcd0u9sls464kdj4qr49d8;topic=3088.msg36314#msg36314 <=== IMPORTANT LINE
Content-Length: 0
Content-Type: text/html; charset=UTF-8

Here's what's interesting - I'm making a request with a Googlebot user-agent and i'm getting 302 redirect to a PHPSESSID URL.

I started poking around and I found that even if I put "session.use_cookies = 1, session.use_only_cookies = 1, session.use_trans_sid = 0" in the php.ini file, the forum's PHP Info page says that "use_only_cookies" is off, but my test php_info() page returns it's on.

So I searched in the code and found the following in the Session.php file:
// Attempt to change a few PHP settings.
@ini_set('session.use_cookies', true);
@ini_set('session.use_only_cookies', false); <=== IMPORTANT LINE
@ini_set('url_rewriter.tags', '');
@ini_set('session.use_trans_sid', false);
@ini_set('arg_separator.output', '&amp;');

When I tried to comment out the three lines about cookies the curl command returned URL without PHPSESSID in it, but Google marked the new location as no-index (and I think this is normal, because the link is to specific post - location: https://fordbg.com/forum/index.php?topic=3088.msg36314#msg36314)

And finally the question
Do I remember wrongly that if a page is fetched with user-agent matching one of the bots in the list in the admin panel, the forum serves the pages as stateless ones? If I'm wrong, what could possibly happened in the last few months for PHPSESSID to be set in the URL? My robots.txt has the PHPSESSID disallow rule for years and I haven't seen such massive peak in blocked pages. I'm on a shared hosting which had a major update a few months ago, but I don't see what could be wrong about the PHP.

Thank you and sorry for the long post!

shawnb61

Quote from: Pyrhel on June 02, 2025, 09:18:57 AMI noticed that the count of blocked pages by robots.txt are getting more and more (currently 200k).
Where are you getting that measurement?  Google Search Console?  I.e., is this post about Google, or is it a generic question about search engines in general? 

Important distinction, because Google has some known behaviors.  Crawlers at large do not...

Quote from: Pyrhel on June 02, 2025, 09:18:57 AMI'm making a request with a Googlebot user-agent and i'm getting 302 redirect to a PHPSESSID URL.
I don't think it's tied to the Google user agent - it's probably tied to your browser settings. 

This normally happens when you have cookies disallowed.  SMF will still try to store a session, and when cookies are disallowed, instead of passing session info via cookie, it will pass it via URL.  I.e., try loosening your security settings a little & see if that goes away.

Quote from: Pyrhel on June 02, 2025, 09:18:57 AMI started poking around and I found that even if I put "session.use_cookies = 1, session.use_only_cookies = 1, session.use_trans_sid = 0" in the php.ini file, the forum's PHP Info page says that "use_only_cookies" is off, but my test php_info() page returns it's on.
That's because the "session." part of it means it's for that session only - that page load.  You're not changing a server setting.  You're doing a - very - temporary tweak that applies to that page load only.

Quote from: Pyrhel on June 02, 2025, 09:18:57 AMwhat could possibly happened in the last few months for PHPSESSID to be set in the URL?
The real question is for whom.  SMF will do that if cookies are disallowed.  It's trying very hard to support users who wish to browse the site & remain anonymous by disallowing cookies. 

Further reading that might help:
https://www.simplemachines.org/community/index.php?msg=4179600
https://github.com/SimpleMachines/SMF/pull/8394
A question worth asking is born in experience & driven by necessity. - Fripp

Pyrhel

The measurements are from Google Search Console, but the question is generic, because I stop all crawlers by the rules in my robots.txt.

For my request (client) I use "curl" where cookies are disabled, the "-A" parameter sets the user-agent to simulate crawling from Google.

Quote from: shawnb61 on June 02, 2025, 10:12:29 PMThe real question is for whom.  SMF will do that if cookies are disallowed.  It's trying very hard to support users who wish to browse the site & remain anonymous by disallowing cookies. 
For Google/other bots. Last night I went through the code of browser detection, is_robot in the user_settings, etc. and tested it how it works with my "curl" requests - it goes as expected, "curl -A Googlebot" is detected as robot and so on. I'm thinking about writing a simple check before "start_session();" part where if detected client is a robot, the session part is skipped. I don't think bots need sessions, or I'm wrong? After that I'll restore the "@ini_set('session.use_only_cookies', true);" to its original state.

The strange thing is that last year Google Search Console didn't report so many blocked urls with the same robots.txt... I'm trying to find what have changed in the last months and the only thing that's in my mind is
Quote from: Pyrhel on June 02, 2025, 09:18:57 AMI'm on a shared hosting which had a major update a few months ago
And that bring a new PHP version, but still I can't find ANY relation whatsoever...

Kindred

but, shouldn't the canonical URL definition in the system mean that the phpSessionID doesn't matter, because the canonical URL is the one that is tracked?
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

Pyrhel

For SEO purposes - you're absolutely right - Google and others search engines will add only "clean" urls (canonical ones).

But there's no reason to leave non-canonical (at least parameters like session IDs) for crawlers - this will "pollute" statistics with multiple "unique" pages even though they won't be indexed as separate pages, lots more traffic (minor reason) and wasted "crawler budget" - if the crawler checks 1000 pages daily on your forum (just example number) those 200k pages that have been reported as blocked will be checked for next 2 years... ;D

I just love clean and real statistics in the console and I don't see any downsides if I disable sessions for crawlers with a simple patch O:)

shawnb61

Quote from: Pyrhel on June 03, 2025, 03:42:52 AMThe measurements are from Google Search Console, but the question is generic, because I stop all crawlers by the rules in my robots.txt.
A reminder: robots.txt doesn't stop anything.  Good bots do honor it.  The other 99% ignore it entirely.
https://www.simplemachines.org/community/index.php?msg=4179600

Quote from: Pyrhel on June 03, 2025, 03:42:52 AMFor my request (client) I use "curl" where cookies are disabled, the "-A" parameter sets the user-agent to simulate crawling from Google.
As pointed out above, the reason you saw PHPSESSID in your URL was because you had cookies blocked.  It had nothing to do with emulating Google.

Quote from: Pyrhel on June 03, 2025, 03:42:52 AMI'm thinking about writing a simple check before "start_session();" part where if detected client is a robot, the session part is skipped. I don't think bots need sessions, or I'm wrong?
Sessions are helpful, and make the startup of any page load a little more efficient.  Without sessions, SMF would need to reload & recheck everything about that user with every page load.  Even with bots.  So, the bots don't need them, but it helps YOU a small bit with resources with each page load.

OTOH, I do a similar thing, where cookies are not detected:
https://www.simplemachines.org/community/index.php?msg=4190204 

The reason is that there is a small set of crawlers that crawl without cookies, and further, without using PHPSESSID.  SMF can't find the session if it isn't passed...  So writing the session is a waste.  What's really bad is that when those bots crawl aggressively, it can cause real problems - SMF & MySQL can't keep up with all the session writes.  So, for that particular form of bot, I do recommend not writing the session. 

It's included in this PR, which also removes all SMF PHPSESSID processing:
https://github.com/SimpleMachines/SMF/pull/8394

Quote from: Pyrhel on June 03, 2025, 03:42:52 AMThe strange thing is that last year Google Search Console didn't report so many blocked urls with the same robots.txt...
There is nothing wrong with seeing a lot of URLs blocked due to robots.txt in Google Search Console.  It's actually a good thing.

Google's search behavior has been changing dramatically for the worse in my experience.  It's like it forgot how to deal with forums...  They are checking many links they never checked before - and shouldn't be checking. 

One example is message-level links...  My search console stat for non-canonical URLs shot up dramatically a while back, at about the same time Google hits to the site rose by A LOT - literally a minimum of 3x, some days MUCH higher.  Google had started loading a page, then, upon seeing message links all over the page, crawling every message link.  But...  Message links come right back to the same page, but with a non-canonical URL.  So I blocked message-level links in my robots.txt.  Now I get fewer non-canonical URL reports, and more 'blocked by robots.txt' reports - BUT - dramatically fewer unnecessary hits on my forum by Google.  Actual page loads by Google returned to normal.  Why did their behavior change?  I dunno... 

The real question isn't whether the number is too high or not.  It's whether you want Google to index those URLs or not.  If you don't want those particular URLs indexed, seeing them on the 'blocked by robots.txt' is a very good thing.  You don't want Google wasting YOUR resources following those links unnecessarily.

Quote from: Pyrhel on June 03, 2025, 03:42:52 AMAnd that bring a new PHP version, but still I can't find ANY relation whatsoever...
Correct.  PHP version is irrelevant here.
A question worth asking is born in experience & driven by necessity. - Fripp

shawnb61

If you're trying to see what Google sees, I wouldn't use curl.  I'm pretty sure Google doesn't use curl.  That might be some of the confusion here.

What I normally do when emulating bots, e.g., to test .htaccess, is override the user agent in a browser & just use a browser:
https://developer.chrome.com/docs/devtools/device-mode/override-user-agent
https://www.whatismybrowser.com/guides/how-to-change-your-user-agent/firefox

Pretty sure Google is using custom software that just issues normal GETs.  Which is pretty much all your browser does.  So I think a browser with a useragent override will be a better test than using curl.
A question worth asking is born in experience & driven by necessity. - Fripp

Advertisement: