Allow search engines to index topics and keep off from everywhere else

Started by DocElectro, January 09, 2012, 06:30:17 AM

Previous topic - Next topic

Arantor

Oh, that robots.txt is awesomely missing the point, assuming it's the one for the site in your profile.

Let's break it down and why it does precisely what I've been saying all along.
QuoteUser-agent: *
Allow: /index.php?topic=
Disallow: *.msg*
Disallow: *;sort*
Disallow: /

So you allow topics, good start.

You disallow entries with msg in, even though Google doesn't consider them to be duplicates or anything anyway because of the canonical tag.

The sort disallow means you prevent them reordering the contents of boards, which is better than nothing I suppose - but the boards are still going to be indexed, because there's no disallow entry. Sensible precaution, if not really necessary.

Then we have the punchline.

Disallow: / means disallow THE ENTIRE SITE, as per: http://www.robotstxt.org/robotstxt.html

QuoteUser-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

DocElectro

I like this coding argument and is really interesting. Actually you were right. What about this

User-agent: *
Allow: /index.php?topic=
Disallow: /


Or


User-agent: *
Disallow: *board=*
Disallow: *action=*

Or let me have yours lol.

Arantor

QuoteI like this coding argument and is really interesting. Actually you were right.

I don't like it, it's pissing me off to have to explain the same thing several times over, growing more frustrated each time when you're continuing to ignore what I'm getting at.

Seriously, you're just not getting the fundamentals here. Spiders are so named because things are webs - things that link together.

A forum is hierarchical content - you have site -> containing boards -> containing topics. If a spider only gets to a topic somehow, there's no guarantee it'll find the rest of the content at all. In fact, it actively won't find the rest of the content if you restrict the boards.

What will happen is that assuming you manage to get a topic in the search engines at all, which frankly will be a miracle, the rest of the topics in that board will hopefully get indexed through the prev/next links between topics, but that's only between the topics of a board, not between the other boards.

Which means you somehow have to get outside sites linking to a topic in each board on your forum and hope that the search engines follow instructions precisely, which they don't.

If you don't have boards indexed, they're not going to find topics, simple as that, and if you block out /, they won't index anything at all anyway.

Still, I've explained this point at least 4 times, do as you please, I'm done trying to explain to you why your approach is fundamentally flawed.

QuoteOr let me have yours lol.

Mine is quite simply empty, on almost all the sites I manage. I have no need of restricting the content from search engines, since the per page rules are more than adequate. I only have the file there to prevent 404 errors from the server and in the logs - on the sites it's not set up like that, it's because the entire site is private and not meant to be search engine visible anyway so I simply use Disallow: / for all user-agents to save them the effort of trying to index the site only to be hitting lots of 'You need to be logged in' messages.

Advertisement: