Creating a good ROBOTS.TXT for SMF (search engine friendly)

Started by Mr. Jinx, January 27, 2006, 05:32:03 AM

Previous topic - Next topic

Mr. Jinx

In an attempt to optimize my site's search results in google I discovered there were alot of pages that shouldn't be indexed.

For example, google indexed all of my user-profiles without any extra information, because google is a guest on the forum. Indexing > 1000 URL's without any important content!
It also indexed ALOT of the pages in WAP/IMODE or PRINTER format, so most topics were indexed in three or more different formats, all containing the same content.
This is such a waste of crawling time...

That's why I created the following ROBOTS.TXT file which disallows google and other search engines to crawl special URL's. This will make your website "search engine friendly" !

Place the following content in a file called robots.txt and place it in your website's root directory:

User-agent: *
Disallow: /forum/*action*
Disallow: /forum/*sort*
Disallow: /forum/*msg*
Disallow: /forum/*prev_next*
Disallow: /forum/*;all
Disallow: /forum/*;wap
Disallow: /forum/*;wap2
Disallow: /forum/*;imode
Disallow: /forum/Themes/


If you have installed your forum in another directory you'll have to replace "/forum/" with another directory.

I've been using this robots.txt for a few months now, and currently every thing is indexed the way I like it. You may have to use google's auto removal tool to speedup things. (be sure what ýou're doing, a wrong wildcard could remove your complete site)

krustyop

Thanks for posting this,

How can we do something similar for all other bots?

vbgamer45

To get the same effect for all search engines
Change
User-agent: Googlebot
To
User-agent: *
Community Suite for SMF - Take your forum to the next level built for SMF, Gallery,Store,Classifieds,Downloads,more!

SMFHacks.com -  Paid Modifications for SMF

Mods:
EzPortal - Portal System for SMF
SMF Gallery Pro
SMF Store SMF Classifieds Ad Seller Pro

krustyop


H

Quote from: Mr. Jinx on January 27, 2006, 05:32:03 AM
Google is the only searchbot that handles wildcards.

Where is your evidence for this?
-H
Former Support Team Lead
                              I recommend:
Namecheap (domains)
Fastmail (e-mail)
Linode (VPS)
                             

krustyop

They say here:

QuoteNote also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

H

Quote from: krustyop on March 02, 2006, 02:27:38 PM
They say here:

QuoteNote also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".

That is the offical standard but I highly doubt the big search engines (yahoo, msn, google) actually use it. They probably all support wildcards
-H
Former Support Team Lead
                              I recommend:
Namecheap (domains)
Fastmail (e-mail)
Linode (VPS)
                             

Mr. Jinx

Thats right, this should also work with other search engines. I've changed it.

statornic

User-agent: *
Disallow: index.php?action=search*
Disallow: index.php?action=calendar*
Disallow: index.php?action=login*
Disallow: index.php?action=register*
Disallow: index.php?action=profile*
Disallow: index.php?action=stats*
Disallow: index.php?action=arcade*
Disallow: index.php?action=printpage*
Disallow: index.php?PHPSESSID=*
Disallow: index.php?*rss*
Disallow: index.php?*wap*
Disallow: index.php?*wap2*
Disallow: index.php?*imode*


This code will work for a forum located in a subdomain root ?

Like: http://forum.crestini.com

H

-H
Former Support Team Lead
                              I recommend:
Namecheap (domains)
Fastmail (e-mail)
Linode (VPS)
                             

H

-H
Former Support Team Lead
                              I recommend:
Namecheap (domains)
Fastmail (e-mail)
Linode (VPS)
                             

statornic

Like this:
User-agent: *
Disallow: index.php?action=help*
Disallow: index.php?action=search*
Disallow: index.php?action=calendar*
Disallow: index.php?action=login*
Disallow: index.php?action=register*
Disallow: index.php?action=profile*
Disallow: index.php?action=stats*
Disallow: index.php?action=arcade*
Disallow: index.php?action=printpage*
Disallow: index.php?PHPSESSID=*
Disallow: index.php?*rss*
Disallow: index.php?*wap*
Disallow: index.php?*wap2*
Disallow: index.php?*imode*


Thanx huwnet.

Dannii

User-agent: *
Disallow: index.php?action=help*
Disallow: index.php?action=search*
Disallow: index.php?action=login*
Disallow: index.php?action=register*
Disallow: index.php?action=admin*
Disallow: index.php?action=post*
Disallow: index.php?action=who*
Those are the only ones i'd block.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

statornic


Dannii

Because with the exception of 'who' those will be the same in most SMF forums, and duplicate information is useless for search engines. The who page is blocked because it changes constantly and never will be the same as what is index. I would unblock the calendar, stats and rss pages etc, because they contain unique useful information that should be indexed.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

statornic

Thank you.

What about this code, will block PHPSESSID's... or this is something else ?

Disallow: index.php?PHPSESSID=*

Dannii

As PHPSESSID is placed in the url for the first 3 pages you view (and all of them if cookies aren't work AFAIK), then that would block every page in your forum I think. Google doesn't have a problem with sessions anyway, and probably most of the modern search engines don't either.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

Mr. Jinx

Quote from: eldacar on March 09, 2006, 03:36:44 AM
User-agent: *
Disallow: index.php?action=help*
Disallow: index.php?action=search*
Disallow: index.php?action=login*
Disallow: index.php?action=register*
Disallow: index.php?action=admin*
Disallow: index.php?action=post*
Disallow: index.php?action=who*
Those are the only ones i'd block.

What about the "profile" ? If you don't include that, google will index all your users. If guests like google don't have rights to look at profiles, google will still index all users (with a page saying you don't have the rights...)
Also I think the "printpage" and "rss" etc.. are important. I don't want that to be indexed. Only the real content should be indexed.

Dannii

Well if you don't allow guests to view profiles, of course block that too. If you do, then let the engines index the profiles.
"Never imagine yourself not to be otherwise than what it might appear to others that what you were or might have been was not otherwise than what you had been would have appeared to them to be otherwise."

geoffs

Hmm, I just tested the additions to robots.txt as described in the previous replies and found that google will accept urls that end with text as proposed in the disallow rules. For example:

robots.txt --> Disallow: index.php?action=admin*
test against --> http://somedomain/forum/index.php?action=admin
result --> Allowed

I checked this in the online verifier that google provides for robots.txt validation.

However, if you form the robots.txt rule as follows then things work correctly:

robots.txt --> Disallow: *index.php?action=admin*

One thing that bothers me though is that the rules as proposed here only restrict urls that point to index.php and it is possible that other installed apps use the same url formation as smf (ie: index.php?action=something) and I might not want other apps urls to be restricted in the same way. I ended up changing the robots.txt rules as:

Disallow: *forum/index.php?action=admin*
Disallow: *forum/index.php?action=help*
.
.
.

This works as expected when tested on the google robots.txt validation page.
hxxp:www.geoffshapirophotography.com [nonactive]

Advertisement: