Robots.txt or Not Robots.txt - That is the question

Started by karlbenson, April 16, 2008, 05:13:44 AM

Previous topic - Next topic

karlbenson

I thought I'd post this up and see what you guys think of this dilemma.
I've been SEO'ing my forum as much as I can (apart from SEO-urls - since its a waste of time).

When I started the process, of the 15,000 requests a day from spiders for my forum pages, I estimate 95%+ were for pages that I wouldn't want indexed / are duplicated content.

Example are urls with .msg / topicseen/ boardseen /prev_next in them.

One of things I did was setup a robots.txt
With this in hand its prevented thousands of requests a day.  And gone part of the way to helping me go better than reversing the percentage so that now 99.99% are one that I do want indexed.  Great news I think.

However, and this is where the punch comes in.  A robots.txt only prevents a page from being crawled, it doesn't prevent the url from being added to the search engines index.

Google says
"If you have a noindex meta tag on a particular page, that page needs to be allowed in your robots.txt file; otherwise we won't be able to crawl the page, and thus won't be able to see the meta tag on the page. Note that disallowing a URL from crawling (in your robots.txt file) does not necessarily mean it won't be indexed."

So I must waste bandwidth and resources for search engines to attempt to index a page which they are then told not to index.  OR continue to use a robots.txt and save that the resources/bandwidth, but end up with blank urls that Google comes across, being added to its index.

I've described this as both inheriently flawed and stupid.  But thats seo folks!

Thoughts?

Ryan

#1
Well basicly they only go into the page to to take the content and cache it in google..
As a google cached page..

And when google display links to pages they usualy show a portion of text from page underneath.

I supose if you block it from actualy going into the page yes it would to a degree decrease your SEO.

In simple terms  ;)

Although while were on this subject i did actualy have a thought the other day, why doesnt SMF code smf so these crawlers crawl through the WAP wireless modes, that will save bandwidth right?

Couldnt you script somthing to divert all crawlers into wap mode? or would that not be possible.

karlbenson

Yes they cache the content. However, I've told them not to crawl that page through a robots.txt. So it never visits. But if it comes across a link on another page, it just adds the link.

Basically there is no way to block a spider from crawling AND indexing, unless you let the spiders visit the pages to see the noindex. Robots only stops crawling, not indexing of the link.

WAP pages are useless.
One of the first things I've done is block the wap pages.
(I believe SMF has added a no-index has been added to these in the latest svn).

SEO-Wise your going to get punished more for the duplicated content than from reducing it.

The solution I'm having to settle with is
- allowing Google to browse ANYTHING, and let it see the noindex tag
- blocking everything but topic/board pages for yahoo (its too aggressive to let it attempt to index anything)
- block a few filetypes from msnbot and a generic list (this spider is retarded, so it doesn't really matter)
- a generic 'safe' valid robots (should be compliant for every spider as it doesn't use wildcards or end-anchors)

Advertisement: