Uutiset:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu
Advertisement:

Still not search engine friendly - SESSID URL

Aloittaja pollyx, maaliskuu 03, 2005, 02:42:11 AP

« edellinen - seuraava »

pollyx

After moving my forum from phpbb to SMF (better coded, more cleanly designed in my opinion) I found that search engines cannot spider my subpages any more, even though I can see that Googlebot, MSN and many of the other bots visit my site daily.

Strange thing is, sometimes my subpages are correctly shown to guests without a Session ID, and at other times (or at the same time from other PCs!) the URL contains a Session ID, which is bad for spiders:

http://ip-forum.net/forum/index.php/board,4.0.html [nofollow]
http://ip-forum.net/forum/index.php?PHPSESSID=b47b560d719e50c94306faa08c1ec3dc&board=4.0 [nofollow]

This can´t be the only problem, however, since this way at least some pages should have been spidered.  It´s not a theme problem either, because it occurs with the YABB classic theme as well, and the SMF 1.0.2 installation is unmodified. I even did a clean new 1.0.2 install and installed the themes again: same effect.

I am willing to program around this issue, or even write a rewrite rule, but have no idea, where to start, since the problem is so intermittent.

Any ideas?

PollyX
IP Security Forum [nofollow]


Trekkie101


Ben_S

SMF is spiderable, I imagine it hasn't spidered your board as you currently have no page rank, you need to get some inbound links.
Liverpool FC Forum with 14 million+ posts.

pollyx

Lainaus käyttäjältä: Trekkie101 - maaliskuu 03, 2005, 11:34:42 AP
Try doing it with the www. at the beginning.

With the "www." prefix the long "..?PHPSESSID..." URL is still there.

Ben_S

The Session id isn't shown to search engines and is only there for the first few clicks anyway whilst SMF decides if cookies are useable.
Liverpool FC Forum with 14 million+ posts.

pollyx

Lainaus käyttäjältä: Ben_S - maaliskuu 03, 2005, 11:37:47 AP
SMF is spiderable, I imagine it hasn't spidered your board as you currently have no page rank, you need to get some inbound links.

No. You won´t see a page rank yet, because the forum has moved a new domain yesterday. Only hours after setting up the redirects I saw Googlebot coming in, but even Google needs at least another day or so until the pages shows up in the search index.

Until yesterday, the forum was on a heavily spidered site for 6 weeks, but did never get past the index page. Until January I had the previous phpbb Forum at the same URL, and it took only 3 days to spider my site completely. Every new page needed a maximum of three days to show up in the index, so there is definitely something severely wrong.

The question is not whether SMF is spiderable or not (It obviously is), but whether it is search engine friendly. A PR7 site will probably be spidered even if the URL contains question marks and Session URLs, but for all lower ranked sites indexing may take forever (if it occurs at all).

Also, if there is an option to show URLs without question marks, it should work. I rechecked my site today with a Search Engine Emulator sortware, and it also showed these ugly URLs.

PollyX
IP Security Forum [nofollow]

[Unknown]

Lainaus käyttäjältä: pollyx - maaliskuu 03, 2005, 02:42:11 AP
After moving my forum from phpbb to SMF (better coded, more cleanly designed in my opinion) I found that search engines cannot spider my subpages any more, even though I can see that Googlebot, MSN and many of the other bots visit my site daily.

Search engines will DROP YOUR FORUM OFF THE MAP, for a period.  Don't expect to be listed for about a month.  I'm sorry, but this is how it is - search engines don't like you changing 99% of your links, and so they kick you to the curb for one full rotation.

LainaaStrange thing is, sometimes my subpages are correctly shown to guests without a Session ID, and at other times (or at the same time from other PCs!) the URL contains a Session ID, which is bad for spiders:

http://ip-forum.net/forum/index.php/board,4.0.html
http://ip-forum.net/forum/index.php?PHPSESSID=b47b560d719e50c94306faa08c1ec3dc&board=4.0

As Ben_S said, the session ID is not shown to bots.  Go search Google for this forum, you won't see a single link with a session id.

http://www.google.com/search?q=allinurl%3Aip-forum.net
http://www.google.com/search?q=site%3Asimplemachines.org

Lainaus käyttäjältä: pollyx - maaliskuu 03, 2005, 12:43:15 IP
No. You won´t see a page rank yet, because the forum has moved a new domain yesterday. Only hours after setting up the redirects I saw Googlebot coming in, but even Google needs at least another day or so until the pages shows up in the search index.

Googlebot, at least, doesn't work as you assert.  It's not immediate, and just because you say Googlebots aspiderin' doesn't mean you should be listed quite yet :P.  Especially if you changed your domain too!

Also remember that Google only lists the top of the barrel.  Sites not "good enough" won't even be put on its index.

Lainaait took only 3 days to spider my site completely.

You were just lucky enough to get the right rotation, I suspect.

LainaaThe question is not whether SMF is spiderable or not (It obviously is), but whether it is search engine friendly.

Also, if there is an option to show URLs without question marks, it should work. I rechecked my site today with a Search Engine Emulator sortware, and it also showed these ugly URLs.

If you're really concerned, it's good you have search engine "friendly" URLs on.  It won't show them if there's a session id, which it will only show to common browsers anyway.

I know for a fact that Googlebot, MSN, Inktomi, etc. do not see the session ids.  If you think there are other issues, maybe there are, but believe me here when I tell you the session id is not one of them.

-[Unknown]

pollyx

#7
Thank you, unknown, for your explanation, but I still feel very, very uncomfortable.

I just tried the following SEO Spider emulators:

http://www.searchengineworld.com/cgi-bin/sim_spider.cgi [nofollow]
IBP 8 Search Engine Emulator (Windows program)
http://www.webmaster-toolkit.com/link-checker.shtml [nofollow]

They all showed ugly URLs like this one:

http://www.simplemachines.org/community/index.php?PHPSESSID
=f50cbed30de5aa5fd71c25dc0911afa0&board=13.0

It is hard for me to believe, that SMF correctly identifies all real spiders, while all spider emulators are failing.

The fact that Google results don´t show SESSID.tags is not proof that the the spiders never see Session IDs. Since those session IDs are not always there, it may take bots several visits, until they happen to get friendly URLs and index them. But this would definitely not be search engine friendly!

To further find out what´s going on, I have just added a static link in the template_main_below() function. If, during the next days, Google indexes this page but still refuses to swallow my SMF content, then we´re in trouble.

pollyx
IP Security Forum [nofollow]




[Unknown]

Lainaus käyttäjältä: pollyx - maaliskuu 03, 2005, 02:25:01 IP
It is hard for me to believe, that SMF correctly identifies all real spiders, while all spider emulators are failing.

I have little faith in these "spider emulators".  I bet they don't even hit robots.txt either.

Lainaa
The fact that Google results don´t show SESSID.tags is not proof that the the spiders never see Session IDs. Since those session IDs are not always there, it may take bots several visits, until they happen to get friendly URLs and index them. But this would definitely not be search engine friendly!

You see, this is where a certain peice of information becomes important: I wrote the code that shows the session ids, all of it!  This code has two major points:
  - it's only activated when SMF detects that no cookies/cookie could be set AND you're using a common browser.
  - it *ALWAYS* shows the first time you view a page, given the above conditions.

This means, if you clear all your cookies, and use Opera, Mozilla, Firefox, Safari, Internet Explorer, Mac IE, Omniweb, etc., etc. to view this forum, you will see the session ids.  Every time.  It is not unpredictable, as you seem to think, but TOTALLY and UTTERLY predictable.  They aren't "not always there" in the way you imply... that's incredulous!

The fact that your "emulators" are seeing the URLs shows two things: one, they aren't very good; and two, they are sending a generic user agent.  Bots should not, under any circumstance except fraud detection, pretend to be a real user agent.  Thus, these emulators are not worth the effort you're expending on them.

Try it.  Since you're such an expert, I'll assume you know what the User-Agent header Googlebots send is.  Use Opera or Firefox with the user agent switcher installed, and change your user agent.  Clear your cookies as you wish, and see if you see any session ids.

-[Unknown]

pollyx

O.K., that explanation clears things up!

You´re right, I tried two different google user agent strings, and - voila - no PHPSESSID tags any more. So this apparently isn´t the reason for my site not being indexed.

I think I´ll sit back and wait for the outcome of my spider test page!  :)

Thanks for your patience!

pollyx
IP Security Forum [nofollow]

[Unknown]

Now that you've accepted that, I'm not trying to say there may not be a problem; even, possibly, one SMF is causing/aggrevating.

All I'm saying is it's not the session ids.  Please, tell us if any new information becomes available.... as of now, there's little to go on, and I'm hopefully Google will put you up next cycle.

-[Unknown]

pollyx

Just giving some final feedback:

Last night Google has started inexing the first 20 pages of my SMF forum, including my pure html test page with a short URL.

The fact that the short-URL test page took exactly as long to be indexed as the other SMF-pages means also that there is no need to apply further URL-rewriting for the sake of SEO. As far as Google is concerned, I think it is search engine friendly!

pollyx
IP-Security-Forum [nofollow]

pollyx

Just found a minor, but annoying issue:

When showing the latest topics on another server via

include 'http://ip-forum.net/forum/SSI.php?ssi_function=recentPosts';

there are the "?"s again in all URLs, even if they have been shut off in the forum itself. (here is an example [nofollow])

This is bad, because google doesnt like to see identical content in different URLs. Is it possible to change this?

pollyx
IP Secuirty Forum [nofollow]

[Unknown]

After including SSI.php, put this:

ob_start('ob_sessrewrite');

Meaning:

<?php require_once('/path/to/SSI.php'); ob_start('ob_sessrewrite'); ?>

-[Unknown]

pollyx

O.K., this gets rid of the "?"  :)  but adds an extra ".html"  :(

In my forum, any posting shows up with at least tree different URLs:

/index.php/topic,66.0.html       (Board View)
/index.php/topic,66.msg150.html#msg150     (ssi_recentTopics)
/index.php/topic,66.msg150/topicseen.html#msg150    (external include with ob_start)

This is not good for SEO. Is it possible to modify the internal and external include functions, so that they consistently show up only with the short (Board View) URL?

This would be fantastic because the recentTopics and recentPosts function are an excellent way to speed up promotion of new postings into search engines when called from another site, or even from the forum start page.

pollyx
IP Security Forum [nofollow]






[Unknown]

Lainaus käyttäjältä: pollyx - maaliskuu 12, 2005, 05:47:47 AP
/index.php/topic,66.0.html       (Board View)

This is for the listing of posts in a topic.

Lainaa
/index.php/topic,66.msg150.html#msg150     (ssi_recentTopics)

This is a link to a specific post in a topic.

Lainaa/index.php/topic,66.msg150/topicseen.html#msg150    (external include with ob_start)

This is a link to a post in a topic which should mark the board it's in as read if it's the only topic...

-[Unknown]

pollyx

I´m aware that there is reasoning behind those URLs, but does this mean there is no way to make the second and third URLs look identical to the first, so that search engines won´t treat these as duplicate content??

Btw, "..mark(ing) the board it's in as read if it's the only topic." doesn´t really make sense anyway when including anonymously from a foreign website, does it?.

pollyx
IP Security Forum [nofollow]


[Unknown]

Sure it does, otherwise the board it's in might show as new even if it wasn't.

It's possible to make them look the same, I suppose, by editing SSI.php... but the whole point is that they do link to different things.

destalk

#18
pollyx

I think [Unknown]'s point about the urls serving different functions is correct. It's a logical progression. Your best option to avoid duplication might be to use a robots.txt file to exclude certain parameters. Google is easy as they recognise wild cards and so excluding urls with a *msg type of command, for examples, should be easy. The other SEs seem to only use the original robots.txt standard, which is a bit more tricky, if impossible.

I also use the wildcard command in robots.txt file to exclude all dynamic urls from Google. This is especially useful as, even with SE friendly options enabled in SMF, the pull down menus for each board still point to their dynamic url.

As for the SEs not including the SESSID urls, I'm afraid that they do. I have checked on several SMF forums of various versions and they all appear to have pages indexed with the sessid style url. Interestingly, Google seems to be the worst for this. I've given up trying to stop this happening.


Advertisement: