Do SMF posts appear in search engines? (Something to be concerned about)

Started by geezmo, September 06, 2006, 09:16:08 PM

Previous topic - Next topic

青山 素子

Quote from: Witte on December 12, 2006, 06:17:47 AM
I am on 1.1, and decided to remove the SEO mod...because of the robots-noindex that was added in final. That had me worried at first, it looked like it would just tell all the robots not to index the page...I hope I am right. I also have the archive mod from Nikolas, with the link to it as the first link on the page...

Actually, it does tell the search engine robots not to index the page, but only on pages that are using links that point to specific posts. This avoids the engine from indexing duplicate content which might have been causing Google to not index the forums. For main thread links and page links, things are normal.
Motoko-chan
Director, Simple Machines

Note: Unless otherwise stated, my posts are not representative of any official position or opinion of Simple Machines.


Joshua Dickerson

Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

destalk

It's interesting reading this discussion. I have read similar discussions on vBulletin forums about how search engine unfriendly vBulletin is. In the scheme of things, SMF is pretty good out of the box. Although, I've noticed that some themes are less helpful than others when it comes to indexing and the use of a good robots.txt file is recommended.

QuoteSMF 1.1 Final will have a meta robots noindex tag on the .msg, print pages etc which should have the same effect as the robots.txt.

I wish this kind of stuff was optional. If there is one thing that Google does dislike, it is contantly changing 'behind the scenes' stuff. It has the problem of potentially looking like spam or 'artificial manipulation'. I have spent time getting my robots.txt file right and I'm not sure that I would like this to have to 'compete' with SMFs own set of noindex parameters. Is there a way to edit these noindex tags in SMF 1.1? Also, noindex works slightly differently from the way that a robots.txt disallow command file does.

destalk

Some thoughts on the duplicate links issue.

I'm no expert. But from what I've read, it seems that not getting pages indexed is rarely a duplicate link issue. Google simply decides to choose one set of content and relegates the rest to the supplemental index.

My main issue with SMF from a SE friendly and consistency point of view, is that even with SE friendly URLs enabled, it still gives out the dynamic urls to search engines. Using 301 redirects is the only way to keep consistency, but it's not an ideal solution.

青山 素子

Quote from: destalk on December 12, 2006, 02:02:13 PM
I'm no expert. But from what I've read, it seems that not getting pages indexed is rarely a duplicate link issue. Google simply decides to choose one set of content and relegates the rest to the supplemental index.

Perhaps not, but Ben_S actually did some trial on his rather large board with the changes and saw a huge increase in index count after making the change.

Quote from: destalk on December 12, 2006, 02:02:13 PM
My main issue with SMF from a SE friendly and consistency point of view, is that even with SE friendly URLs enabled, it still gives out the dynamic urls to search engines. Using 301 redirects is the only way to keep consistency, but it's not an ideal solution.

It shouldn't be doing that, but I don't have a public board that can do SEO links to play with right now and find the cause.
Motoko-chan
Director, Simple Machines

Note: Unless otherwise stated, my posts are not representative of any official position or opinion of Simple Machines.


Farmacija

i just checked how many pages google indexed by my forum
http://www.google.com/search?q=site:farmaceuti.com+&hl=en&lr=&start=0&sa=N
the most of pages listed are page for login, registration and error pages who tells guests that they cannnot come in without registration....
www.farmaceuti.com
www.farmaceuti.com/tekstovi

Dannii

"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Farmacija

Quote from: Ben_S on October 26, 2006, 08:49:00 AM
The only reason I can think of for topics not doing to well is the duplicate links to the same content, google supports wildcards in robots.txt so something like this may help.

User-agent: Googlebot
Disallow: /forum/*.msg*
Disallow: /forum/*sa=showPosts*
Disallow: /forum/*prev_next*
Disallow: /forum/*action=printpage*
Disallow: /forum/*action=recent*


I put this a 2 weeks ago in my forum's directory i hope it will help  ;)
www.farmaceuti.com
www.farmaceuti.com/tekstovi

destalk

Quote from: Motoko-chan on December 12, 2006, 04:27:53 PM
Quote from: destalk on December 12, 2006, 02:02:13 PM
I'm no expert. But from what I've read, it seems that not getting pages indexed is rarely a duplicate link issue. Google simply decides to choose one set of content and relegates the rest to the supplemental index.

Perhaps not, but Ben_S actually did some trial on his rather large board with the changes and saw a huge increase in index count after making the change.

Indeed. But it's impossible to be sure that one factor has affected the other. I had a forum that had only three pages indexed by Google for over a year. Suddenly, one day, it had several thousand pages indexed. Google is a funny beast. But I agree, it is best to try and reduce duplicate links as much as possible.

Quote from: destalk on December 12, 2006, 02:02:13 PM
My main issue with SMF from a SE friendly and consistency point of view, is that even with SE friendly URLs enabled, it still gives out the dynamic urls to search engines. Using 301 redirects is the only way to keep consistency, but it's not an ideal solution.

It shouldn't be doing that, but I don't have a public board that can do SEO links to play with right now and find the cause.

It is a strange one. But if you check the Google cache of SMF sites that use SE friendly URLS, and hover over the links, you will see that Google has indexed the original 'dynamic' version of the links. Interestingly, Google appears to index both sets of URL. Some search results appear as dynamic PHP urls and some as SE friendly. This is one reason why I leave the default urls on for any new forums that I start. I don't see any problem with the default SMF URLs generally, but it's easier to disallow using a robots.txt file if you use the SE URLs.

I originally thought it had something to do the the PHPSESSID issue. But that has been fixed since version RC2, so I wonder... I know that the drop down menus of SMF forums still use the dynamic URL, but I didn't think that search engines could index those.


Ben_S

If you don't want the meta noarchive tag, you can easilly remove it from the index.template.php file.
Liverpool FC Forum with 14 million+ posts.


Daniel Hofverberg

I've checked, and destalk seems to be right - for some strange reason, my forum is indexed with a combination of the regular URLs and search-engine friendly URLs (although the majority of topics are indexed with the former), even though I have SE-friendly URLs enabled, and it displays correctly for members (with SE URLs exclusively).

I also thought that it had something to do with PHPSESSID, but apparently not as that's fixed but most topics are still listed in search engines with the regular URLs.

Anyone have any ideas why?

青山 素子

Quote from: Daniel Hofverberg on December 13, 2006, 07:46:05 AM
I've checked, and destalk seems to be right - for some strange reason, my forum is indexed with a combination of the regular URLs and search-engine friendly URLs (although the majority of topics are indexed with the former), even though I have SE-friendly URLs enabled, and it displays correctly for members (with SE URLs exclusively).

I also thought that it had something to do with PHPSESSID, but apparently not as that's fixed but most topics are still listed in search engines with the regular URLs.

Anyone have any ideas why?


How long have you had friendly URLs enabled? Google doesn't usually de-index a page unless it gets a 404, so if it indexed an old-style link, it will continue to keep it unless it has some reason to remove it.


Quote from: destalk on December 13, 2006, 06:21:05 AM
Indeed. But it's impossible to be sure that one factor has affected the other. I had a forum that had only three pages indexed by Google for over a year. Suddenly, one day, it had several thousand pages indexed. Google is a funny beast. But I agree, it is best to try and reduce duplicate links as much as possible.

This is why I hold that most SEO techniques are superstition, with the rest being mostly simple things that are just general good practice. Now, some of that superstition is based on behavior of certain search engines, but the problem then is that they often tweak the internals, nullifying the benefit of certain things (long before the use is dropped - friendly URLs anyone?).

My thought is to follow good practices, make use of tools and public hints from the engines (sitemaps are a great idea) and then sit back and see how things go. If your site is not being indexed well, something is wrong on the pages and you should figure out what it is and work on that.

In the latest case of SMF indexing, it was determined that multiple links to the same pages might be causing problems, as they seemed to fit to certain other behaviors that have been considered spammy. As such, these paths were marked to not index (a good behavior in general).
Motoko-chan
Director, Simple Machines

Note: Unless otherwise stated, my posts are not representative of any official position or opinion of Simple Machines.


Daniel Hofverberg

Quote from: Motoko-chan on December 13, 2006, 10:32:01 AM
How long have you had friendly URLs enabled? Google doesn't usually de-index a page unless it gets a 404, so if it indexed an old-style link, it will continue to keep it unless it has some reason to remove it.
I've had friendly URLs enabled pretty much since I first started the forum.

But as topics started long after I enabled friendly URLs are still listed with regular URLs (I e with index.php? etc), obviously something must be wrong that causes search engines to receive those URLs instead of the friendly ones.

青山 素子

Quote from: Daniel Hofverberg on December 13, 2006, 11:51:12 AM
Quote from: Motoko-chan on December 13, 2006, 10:32:01 AM
How long have you had friendly URLs enabled? Google doesn't usually de-index a page unless it gets a 404, so if it indexed an old-style link, it will continue to keep it unless it has some reason to remove it.
I've had friendly URLs enabled pretty much since I first started the forum.

But as topics started long after I enabled friendly URLs are still listed with regular URLs (I e with index.php? etc), obviously something must be wrong that causes search engines to receive those URLs instead of the friendly ones.


Yeah, that shouldn't be happening. I'm going to have to poke around when I get the time and see if I can replicate it to determine the cause.
Motoko-chan
Director, Simple Machines

Note: Unless otherwise stated, my posts are not representative of any official position or opinion of Simple Machines.


destalk

QuoteThis is why I hold that most SEO techniques are superstition, with the rest being mostly simple things that are just general good practice.

Agreed. My view on this is that good SEO can be as much about about usability for human users, as for search engines. Having more than one URL pointing to the same place can be as confusing to humans when trying to add backlinks to a page, as it may be for search engines. Obviously, when developing an interactive site, that can be easier said than done.

QuoteIn the latest case of SMF indexing, it was determined that multiple links to the same pages might be causing problems, as they seemed to fit to certain other behaviors that have been considered spammy. As such, these paths were marked to not index (a good behavior in general).

Would you be able to list exactly what urls the noindex applies to? I'm not able to work that out, so that would be really helpful.

Also, if I want to rely on my robots.txt file instead, do I just stript out the line    <meta name="robots" content="noindex" />', ' from index.template.php. Or do I ned to stript out any of the other code?

Thanks.

青山 素子

Specifically, the previous and next topic links at the bottom of each page have that added, as well as any topic link that specifies a specific message.

It won't hurt to have both a robots.txt and the meta tag, but if you want to remove the tag, don't delete the whole line, just delete up to the "  ', '  " portion, leaving that intact. This is in all index.template.php files for 1.1, so if you are using a non-default theme, you will want to check it too.
Motoko-chan
Director, Simple Machines

Note: Unless otherwise stated, my posts are not representative of any official position or opinion of Simple Machines.


destalk

Sorry, I'm a bit confused (it happens easily ;) ).
Quotedon't delete the whole line, just delete up to the "  ', '  " portion, leaving that intact

Could you please give me an example of what I should delete?

Thanks.

QuoteIt won't hurt to have both a robots.txt and the meta tag

True. The only thing is that I have had no issues with leaving the .msg messages for indexing, so far. Google simply chooses one version of the URL over the other and I'm quite a big believer in not fixing what isn't broken. If it all goes horribly wrong one day, I'll know what steps to take.   :-\

If I suddenly added the noindex, I might lose thousands of indexed URLs. Although it may only be temporary, I would not like to do this without implementing some kind of 301 redirect of the .msg urls to the 'root' urls. But that is far too complex for me.  :o

With the robots.txt I have simply excluded everything with 'action=' in it plus a couple of others, which does the trick.

Also, MSN and Yahoo/Slurp bots now accept wildcards, which makes everything so much easier (it used to only be Google).

Toadmund

If I want to disallow my 'login' page and my 'theme directory' pages from being indexed, what do I do?
Do I add this to the appropriate spot?
$context['robot_no_index'] = true;

If so, whereabouts do I insert this code?
For the second one I assume it's in '$themes_dir'?

destalk said:
QuoteWith the robots.txt I have simply excluded everything with 'action=' in it plus a couple of others, which does the trick.
More specifically, I would like to stop google and others from indexing 'actions', how do I go about doing that?

Dannii

Toadmund, put *action=login* and /Themes/ in your robots.txt file.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Advertisement: