Allow search engines to index topics and keep off from everywhere else

Started by DocElectro, January 09, 2012, 06:30:17 AM

Previous topic - Next topic

DocElectro

Recently, I set up a forum and imported an old database into it. Barely two weeks of the set up, Google has been indexing boards. Please refer to Search Results Here

Using robots.txt, can someone please point me in the right direction to coding this. That is, nothing should be crawl/index other than TOPICS
Thanks

Ricky.

Simplest way would be :
In your robots.txt

User-agent: *
#First allow all
Allow: /*
# Now denying any url with board in it and anything after that..
Disallow: /*board


But test it using google webmaster tool..

Illori

you could also search the forum for robots.txt there are many posts that can help you.

Arantor

If it doesn't crawl boards, how do you expect it to find topics?

DocElectro

Ok I tried this on my own


User-agent: *
Allow: /index.php?topic=
Disallow: /

Unfortunately, this code still index boards.


While this below completely disabled search engines from crawling/indexing forum.

User-agent: *
#First allow all
Allow: /*
# Now denying any url with board in it and anything after that..
Disallow: /*board


What I really want is only TOPICS craw/index by search engines WHILE stuff like boards, child boards, categories, profiles, print pages and many others must not be index.

What else should I do please to correct this ASAP. Thanks

Arantor

QuoteWhat I really want is only TOPICS craw/index by search engines WHILE stuff like boards, child boards, categories, profiles, print pages and many others must not be index.

How are the search engines going to find the topics if they can't see the boards that the topics are in?!

Ricky.

Yes, arrowtotheknee is right, then make it :

User-agent: *
#First allow all
Allow: /*
Allow: /*topic
# Now denying any url with board in it and anything after that..
Disallow: /*board


Or may be you can use some sitemap to submit links of topic only.

Arantor

That doesn't solve the problem. The disallow rule will still prevent boards being indexed - and by definition it will prevent all the topics being indexed too...

Though, I suppose recent posts is still technically accessible at this post as is the very last post in each board, but that's going to seriously hamper search engines actually getting at the threads.

Ricky.

As said, if you can give direct access to all topics using some html sitemap, then can also solve problem.

Arantor

Not really, sitemaps don't guarantee indexing, and search engines have been known to not index pages that aren't linked to anywhere else in their index...

DocElectro

I checked 'who' and I notice search engines only crawl board index of the forum since site going live 2 weeks now (official submission to search engine). I have removed robots.txt and the issue is still there.

Arantor

They could be trying all sorts of things. You see, if they try anything in index.php that isn't a known action, e.g. index.php?action=blahblahblah, it will fall through to the board index...

DocElectro

Like I said in one of my thread sometime ago, that is before submitting my new forum URL to google, they have index more than 300 bad pages (all useless - boards and their description, forum stats, member profiles e.t.c)
I went a step further to place the following in robots.txt

User-agent: *
Disallow: *board=*
Disallow: *;sort*
Disallow: *action=*
Disallow: *.msg*
to block all pages and allow only TOPICS as http://mysite.com/index.php?topic=##.# without msg and topicseen as part of url and it worked.
Unfortunately when the topics are clicked via Search Engine Results, it go back to homepage as in this example here @ No9 list http://www.google.com.ng/m?q=How+To+Take+Screen+Shot+and+subsequently+uploading+or+saving+it.

To solve all this mess, I discoved the best way to index content better is by placing


<meta name="robots"
content="noindex">

In the light of the above, how can I use the meta noindex to block the following from being index
a. Profiles
b. Topicseen/msg
c. Homepage
d. Stats
e. Boards
f. Etc

And allow ONLY topics index - example answerarena.com/index.php?topic=000.0
Someone should please assist me to achieving this. Thanks

Arantor

As you have been told before, it is not possible to block boards access and still have topics indexable - if the boards are not indexed, the search engines won't even get to the topics.

Profiles, stats should be covered by the action= part.

If you block the homepage, NOTHING WILL BE INDEXED AT ALL.

Illori

2 threads merged, please dont open more then one thread on the same issue.

DocElectro

Quote from: arrowtotheknee on January 16, 2012, 06:14:16 AM
As you have been told before, it is not possible to block boards access and still have topics indexable - if the boards are not indexed, the search engines won't even get to the topics.

Profiles, stats should be covered by the action= part.

If you block the homepage, NOTHING WILL BE INDEXED AT ALL.
Very well said. What is the way out then?

Arantor

Um.... not blocking anything works fairly well in most cases, actually.

If you don't want search engines to go somewhere, disable that area for guests and job done.

The general navigation is set up pretty well for indexing purposes, stuff like topicseen should generally be taken care of by the canonical tags embedded into all the public-visible pages.

A lot of pages already do their own embedding of noindex when they're not supposed to be indexed, like the stats pages...

DocElectro

Before I end this topic, more than 300 bad pages has been index out of 305. What's the best way to remove all and then initiate a fresh submission.
As for stats, help, member profiles, search, boards - now how can I disable all these for guest so search engine stop indexing these area more especially (boards and profiles). 

Arantor

For disabling these things... go look in permissions. They're all in there. At least, things like profiles are.

But for the umpteenth time, YOU CANNOT HIDE BOARDS FROM GUESTS. If the boards are hidden, TOPICS WILL NOT BE IN THE INDEX. You cannot GET TO THE TOPICS WITHOUT HAVING THE BOARDS.

I wouldn't normally shout so much but you seem so determined to screw your forum up totally.

Also, what 'bad pages' are you talking about? Where are you getting that information from? You can't 'remove your site from search engines', they will re-crawl your site over time, and if you unbreak the robots.txt file, they'll find your content.

DocElectro

Wow I finally nailed the problem myself. Only topics are indexed without boards.
If anyone cares, I will drop the code or do a /robots.txt on my url. Goodluck

Arantor

Oh, that robots.txt is awesomely missing the point, assuming it's the one for the site in your profile.

Let's break it down and why it does precisely what I've been saying all along.
QuoteUser-agent: *
Allow: /index.php?topic=
Disallow: *.msg*
Disallow: *;sort*
Disallow: /

So you allow topics, good start.

You disallow entries with msg in, even though Google doesn't consider them to be duplicates or anything anyway because of the canonical tag.

The sort disallow means you prevent them reordering the contents of boards, which is better than nothing I suppose - but the boards are still going to be indexed, because there's no disallow entry. Sensible precaution, if not really necessary.

Then we have the punchline.

Disallow: / means disallow THE ENTIRE SITE, as per: http://www.robotstxt.org/robotstxt.html

QuoteUser-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

DocElectro

I like this coding argument and is really interesting. Actually you were right. What about this

User-agent: *
Allow: /index.php?topic=
Disallow: /


Or


User-agent: *
Disallow: *board=*
Disallow: *action=*

Or let me have yours lol.

Arantor

QuoteI like this coding argument and is really interesting. Actually you were right.

I don't like it, it's pissing me off to have to explain the same thing several times over, growing more frustrated each time when you're continuing to ignore what I'm getting at.

Seriously, you're just not getting the fundamentals here. Spiders are so named because things are webs - things that link together.

A forum is hierarchical content - you have site -> containing boards -> containing topics. If a spider only gets to a topic somehow, there's no guarantee it'll find the rest of the content at all. In fact, it actively won't find the rest of the content if you restrict the boards.

What will happen is that assuming you manage to get a topic in the search engines at all, which frankly will be a miracle, the rest of the topics in that board will hopefully get indexed through the prev/next links between topics, but that's only between the topics of a board, not between the other boards.

Which means you somehow have to get outside sites linking to a topic in each board on your forum and hope that the search engines follow instructions precisely, which they don't.

If you don't have boards indexed, they're not going to find topics, simple as that, and if you block out /, they won't index anything at all anyway.

Still, I've explained this point at least 4 times, do as you please, I'm done trying to explain to you why your approach is fundamentally flawed.

QuoteOr let me have yours lol.

Mine is quite simply empty, on almost all the sites I manage. I have no need of restricting the content from search engines, since the per page rules are more than adequate. I only have the file there to prevent 404 errors from the server and in the logs - on the sites it's not set up like that, it's because the entire site is private and not meant to be search engine visible anyway so I simply use Disallow: / for all user-agents to save them the effort of trying to index the site only to be hitting lots of 'You need to be logged in' messages.

Advertisement: