Include an example robots.txt with SMF2 (I posted my example)

Started by periscope, June 09, 2009, 10:50:47 AM

Previous topic - Next topic

periscope

I made a special robots.txt for my forum. It keeps search engines from viewing the calendar, which goes on forever, and from doing various other things like viewing the sign up page, visiting the post reply and post new topic links, and from visiting the delete post link and deleting all the posts if your forum has if for instance a certain board that is on your forum is for testing new features where guest posting and moderation (delete post and delete topic) is allowed.

The purpose is to ease the load on your server, the search engine, and most importantly, to prevent search engines from getting bogged down searching hundreds or thousands unnecessary of pages on the forum which can cause the search engine to wait a very long time to revisit the important pages, like the boards and threads.

For instance my forum got 330 page visits from Google. More than half of those page visits are from the calendar (which as nothing on it), or the "post reply" or "new topic" pages which no search engine needs to see.

For example on my site Google used to visit the front page of my site every other day. Then I installed SMF for fun, and google found it. The Google started crawling hundreds of pages on the forum and got bogged down with that. Now a week after I have removed the forum from my site Google still hasn't revisited the front page or any of the other pages on my site.

Here is a robots.txt that I made. Since SMF is free software, I feel like I should contribute :)

## ******* THE FOLLOWING IS AN EXAMPLE robots.txt FOR SIMPLE MACHINES FORUMS V2
## ******* TO LET SEACH ENGINES SEE BOARDS AND THREADS, BUT NOT ANY OF THE OTHER PAGES
## ******* THAT IT DOESN'T NEED TO SEE, LIKE THE ENDLESS CALENDAR OR THE HELP SECTION OR POST REPLY.
User-agent: *

## Uncomment the following line to block user profiles.
# Disallow: /smf2/index.php?action=profile
## The calendar causes search engines to get bogged down with a large number of
## links and then take a long time to revisit the important pages like new threads.
Disallow: /smf2/index.php?action=calendar
Disallow: /smf2/index.php?action=register
Disallow: /smf2/index.php?action=login
Disallow: /smf2/index.php?action=search
Disallow: /smf2/index.php?action=help
Disallow: /smf2/index.php?action=post
Disallow: /smf2/index.php?action=emailuser
Disallow: /smf2/index.php?action=printpage
Disallow: /smf2/index.php?action=reporttm
## If you happen have a testing area on your forum where guests are allowed to
## try out the forum's moderation features, this prevents a robot from hitting
## the "delete post" button and other things on each thread.
Disallow: /smf2/index.php?action=editpoll
Disallow: /smf2/index.php?action=deletemsg
Disallow: /smf2/index.php?action=splittopics
Disallow: /smf2/index.php?action=move
Disallow: /smf2/index.php?action=removetopic2
Disallow: /smf2/index.php?action=lock
Disallow: /smf2/index.php?action=sticky
Disallow: /smf2/index.php?action=mergetopics


edit:
Fixed "?" and spaces.
Here is another related thread: http://www.simplemachines.org/community/index.php?topic=67944.0
and a blog entry about it:
http://forumsblogswikis.com/2007/06/03/my-robotstxt-file-for-simple-machines/ [nofollow]

karlbenson

After all the seo work I did for smf, I've come to the conclusion where I actually recommend NOT to use robots.txt (especially for Google)
(UNLESS your server is suffering from excessive loads).

I noticed a few bugs with that above.
1) all & should be ?
There should be NO blank lines between useragent and the first disallow line, and should be none between disallow lines.

The best advice I can give you is to register with google webmasters and submit an xml sitemap.

periscope

Thanks I don't know why the ?'s pasted as "&". I didn't know about not having empty lines thanks.

Why is it that robots.txt doesn't play well with google do you know? Just curious.

karlbenson

robots.txt are crawling directives, NOT indexing directives.

- So they do not prevent duplicate content pages from ending up in google IF there are enough internal or external backlinks to that page.
(Although google will index the url into its search engine, without ever crawling the page, so it would be without title/meta, cache etc)

-They will not remove any existing pages from googles index.

periscope

Are you sure? A disallow in robots.txt will make any obedient web robot including Google completely refuse to visit any page under the "Disallow: ". I believe Google will remove any page found to be disallowed by robots.txt too the next time it crawls it.

Are you confusing the site map XML files (site map search suggestions) with robots.txt (the list of do not visit pages)?

karlbenson

No, I'm not confusing them.  I am very certain.

Yes google will not attempt to crawl any page you specify.
BUT
Google can still add the url to its search index without ever crawling it.  It may choose to do so if that url has backlinks on other pages.
And Robots.txt does not cause removal of existing pages from google.

Thats why they are CRAWLER and NOT INDEXING directives.

Indexing directives are controlled by the meta tag
ONLY <meta name="robots" content="noindex" /> can tell a spider NOT to index the page, and remove the page from  searchindex
(or x-robots tag)

If you block a page with robots.txt and meta tag, google will never crawl the page to see the meta tag, so only the robots.txt directive is applied.

Note: There is not currently a way to tell spiders DO NOT CRAWL, and DO NOT INDEX at the same time.
This why several people are proposing an extension to the robots.txt protocol, so aswell as disallow: lines, we could add a noindex: parameter at the same time. But that doesn't appear to be getting any traction at the moment.

periscope

Ahhh I get it, when google crawls the page it adds new links in the queue to be searched later even if they are disallowed in robots.txt. Later on when it comes time to crawl those links it ends up deleting them because of robots.txt.

But it still doesn't hurt anything to put those lines in robots.txt does it? It saves bandwidth, server time and saves the search engine the trouble of indexing the page, even if it still added the URL in to it's database to be searched, it still doesn't spend the time to download and index it. And it prevents trouble to search engine users by not having duplicate content appear in search results right?

Just a thought -- when the crawler is crawling a page looking for new links, theres nothing preventing the crawler from ignoring links that link to disallowed pages in robots.txt. How do we know that Google isn't already doing this to make their crawler more efficient? I know that this does not apply to links from other sites which are under a different robots.txt, but the vast majority of links are from the same site.

I guess as an experiment you could put a robots.txt file up disallowing something, then wait for google to crawl a page that links to the disallowed pages, then remove the robots.txt file and see if the next day when google comes to do the rest if it crawls through the the pages that were disallowed the day before.

Another way to tell would be if you see Google get robots.txt from you site and then not crawl any pages on your site that would mean that what you say about the crawler is true.

I could change the topic of this to include an example robots.txt AND an example site map XML file for Google :)

karlbenson

Note, a spider only fetches a copy of your robots.txt once every 24hrs (it caches it inbetween).
So usually you have to wait upto 24hrs for any changes you make to take effect.

When i was doing seo stuff for smf, I blocked .msg links with a robots.txt
However they were still showing up in google results when I used site:
Everyday I kept removing them with Google Webmaster tools.
However new ones kept on showing up every day.
So thats when I contacted google who explained it to me.

If i didn't want those urls in google, i would have to remove the robots.txt block, and let google see the meta noindex tag on those pages.

periscope

That's interesting.

I still have a hard time believing that. I have never seen a case where something appeared on Google - as in in the search results or as being fetched by Google the webserver log - that was disallowed by robots.txt

Google will always fetch robots.txt before it downloads anything else on your site, I have never seen it download any disallowed content, and how can content that has not been downloaded show up in search results?

Are you sure you didn't make a typo on robots.txt?

karlbenson

I'm sure.

Whilst they don't normally show up in search results (because they don't have any information associated with them, they are JUST a url).  They will however show up if you just search for site:http://mysite.com

Please don't get confused between crawling and indexing.
Crawling is the process of browsing/downloading your page, parsing out the bits it wants (and caching it)
Indexing is the adding of the url to its search index (with information it if has been crawled).

QuoteGoogle will always fetch robots.txt before it downloads anything else on your site, I have never seen it download any disallowed content, and how can content that has not been downloaded show up in search results?
No, search engines usually have a list of urls yet to index.  It won't download a new robots.txt before requesting every page.
It requests a new one usually ONCE every 24hrs, caches it, then uses it to decide what links out of its list of urls that it can index.  In the meantime if you change your robots.txt AFTER google has cached it, then google may attempt to crawl any of the links that weren't blocked by the original robots.txt, UNTIL 24hrs later when google fetches a fresh copy.

periscope

Well in this case I am just thinking of general use where robots.txt stays the same all the time, so if google caches it for 24 hours it doesn't matter.

Yeah what you're saying about how the empty URLs that are  to be crawled will show up if you search for mysite, I understand that. You mean those emptry URLs will show up as "displaying 1 of xxxx pages" but those xxxx may be empty? I know that google will never ever show an empty URL in the search results list.

Just from my perspective as a web master, it is still good to use robots.txt, because it reduces sever load and bandwidth, and duplicate pages that I don't want people to see in a search won't show up in a search engine results page. Even if internally the search engine holds an empty URL for a page that is to be crawled but is disallowed: by robots.txt, it doesn't matter to me.

Advertisement: