News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

Disable cookies for guests

Started by GigaWatt, May 28, 2018, 06:36:33 PM

Previous topic - Next topic

GigaWatt

Is there any way to disable cookies for guests? Crawlers in particular. I wanted archive.org to crawl my site, but the cookie in the URL (PHPSESSID) is preventing their crawler to crawl the site. If I load the URL manually (without the cookie PHPSESSID=32_HEX_DIGITS in the URL), the crawler crawls it just fine, but I can't do this for every single topic and/or post :D.

Is there a way to disable this via .htaccess?
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Kindred

Not that I am aware of.   The system needs a cookie.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

GigaWatt

So, what you're saying is that if the system, for some reason, can't grant me a cookies, I can't visit the site/forum at all? Not even as a guest?
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Aleksi "Lex" Kilpinen

Archive.org has crawled my forum many times on it's own, haven't run in to issues like that. Are you manually submitting the url somehow?
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

shawnb61

You can definitely disable php session cookies via .htaccess. 

The real question is what would happen as a result.  I'm speculating aloud here... 

Cookies enable you to avoid logging in every time you query a site.  Sessions enable you to avoid a bunch of I/O re-establishing your credentials & user state/preferences.  No cookies = no sessions.  I.e., with no concept of logging in, you're always a guest. 

For all crawlers/guests, I expect every single hit would look like a brand-new guest. 

For forum members?  No such thing, as everybody is a guest with no logging in allowed. 

Even if you didn't mind all guests, I would worry about a fairly dramatically increase in I/O. 

You can experiment with it yourself - turn off cookies in your browser!  Try it!   
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

shawnb61

Hmmm...  I just tried it...   And it seemed to work OK...   My speculation was incorrect...

Since it couldn't store the session-id in a cookie, it kept it in the URL.  Once I lost the session ID in the URL (by deleting it) I had to login again.  But as long as the session ID was in the URL, I seemed to be OK. 

I'd still be concerned about I/O.  But it seems quite functional... 
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

GigaWatt

Quote from: Aleksi "Lex" Kilpinen on May 28, 2018, 11:46:30 PM
Archive.org has crawled my forum many times on it's own, haven't run in to issues like that. Are you manually submitting the url somehow?

Yeah, I manually submitted the main (index) page and afterwards, clicked on a board, archive.org crawled the board (no problems), and afterwards, clicked on a topic, and just when it started to crawl the topic, it reported that the topic is not in their database (naturally), but the funny thing was that the URL of the topic had the PHPSESSID string right after index.php (...index.php?PHPSESSID=...), which doesn't appear if you're viewing the forum as a guest or if you're logged in ???.

I just have to add that all of this was done inside the way back machine interface. So, if you submit a page, it starts crawling it, it crawls it and loads the page from it's server. Click on a topic (in the way back machine interface), if it hasn't been crawled, it crawls it, saves it and, again, loads it from it's server... that is, it would do that, if it doesn't automatically add the session ID in the URL, and report that the URL doesn't exist ::).

Quote from: shawnb61 on May 29, 2018, 12:18:34 AM
Since it couldn't store the session-id in a cookie, it kept it in the URL.

This was also my presumption, that's why I though about disabling cookies for guests... didn't know that you couldn't disable cookies at all, even if you're a guest.

Quote from: shawnb61 on May 29, 2018, 12:18:34 AM
Once I lost the session ID in the URL (by deleting it) I had to login again. But as long as the session ID was in the URL, I seemed to be OK.

Yeah, that's true if you're on a real browser. But what if you're a bot? How come Google bots have no problem crawling the forum and report the links without a session ID in the URL (it should be in the URL since they can't store cookies), but the crawler from archive.org doesn't automatically delete the session ID from the URL before or after it crawls the URL ??? (I'm guessing that's what Google bots do). Not only that, the crawler from archive.org reports that there is no such URL if the session ID field is present in the URL. Delete the session ID, hit Enter, presto, problem gone, it crawls the topic without a problem.
"This is really a generic concept about human thinking - when faced with large tasks we're naturally inclined to try to break them down into a bunch of smaller tasks that together make up the whole."

"A 500 error loosely translates to the webserver saying, "WTF?"..."

Advertisement: