Hi folks -
I;m loving the new forum software but I noticed in my index files that Google is indexing print pages, which is dupe content and not good. What commands can I use in robots.rxt to block the searching of all dupe content by Google?
My forum is not installed in the root folder, but rather a secondary folder:
www.example.com/forums
Would I need two robot.txt's (one in the forum folder)or should I just put the new commands in the root robots.txt?
Thanks!
Don't be afraid of the search. I promise, it won't bite. :) Maybe a nibble.... :P
http://www.simplemachines.org/community/index.php?topic=251309.0
You can only have ONE robots.txt per domain.
Note, in smf 2.x, the printpages (and other duplicate content pages) are noindexed.
Also remember robots.txt only prevents spidering, it does not prevent search engines from adding those links to its search index. (And note, if you block a page with robots.txt, the search engine will never spider the page to see the meta noindex).
Robots.txt there is only use for saving bandwidth. If you want seo/duplicate content indexing, then don't use a robots.txt, but meta noindex any pages that are duplicate content (which aren't already noindexed).
P.S SMF 2.0 noindexes duplicate content much better than smf 1.1.x
It definitely doesn't cover everything, though. Wap and Wap2 pages for instance shouldn't be indexed since they have duplicate content, but you can even see that Google indexes them on the official forums here.
http://www.google.com/search?q=robots.txt+seo&domains=simplemachines.org&sitesearch=simplemachines.org
That would be an easy fix though, I'd think.
EDIT: in fact, it might be a good idea for someone to log that in Mantis.
Thing is, alot of people want the wireless pages indexed by the mobile spiders.
We've got to find a way of allowing mobile spiders to spider them whilst noindexing for normal spiders.
Although in RC2 we add carnonical tag support for yahoo/google/microsoft on topics/boards
aswell as wap2/imode.
Although wap format doesn't support it nor meta noindex, so we might have to look at alternative ways of noindexing it.
Just for the record, Google claims not to have any trouble finding and ignoring print pages, WAP pages, etc. (so long as the content is substantially the same as for the normal pages). Take that with as many grains of NaCl as you wish... if Google reports that you're being penalized for duplicate content, either you've got too many differences in content, or their spiders aren't as good as they claim. I wouldn't go overboard worrying about print pages and such, in advance.
Ok so i'm a little confused. Now i'm not correcting anyone in a hostile way but one of the developers in this thread stated that robots.txt only stops google from crawling the page not from indexing it. This is totally incorrect.
robots.txt is the equivalent of the noindex meta tag not the nofollow tag. Granted originally it was to stop crawling but the robots have adapted it to the users now. for example if you remove a page from google via the webmaster dash then it states quite clearly that in order for it to be removed then the next time it crawls it should either find a 404 or be excluded via the robots file. Also adsense utilising similar spiders and remains unaffected when using robots, even when directed specifically at the user agent for the adsense spiders.
This has also been tested by me recently where i have put "disallow: index.php?board*" as i don't need the board index to be indexed, and it still will follow the board links to get to the topics, googles dash also logs what is and isn't restricted. Sorry to correct but google will still crawl the page for links it will just ignore it when indexing.
Ok so now that bits out of the way lol...The reason i ended up at this topic after searching through millions of other SMF seo specific stuff on here (which has been very useful) i have come across an issue similar to that of the original poster where google is indexing the printpages...however mines not a duplication issue and i have now blocked the printpages...but my issue is that it seems to be ONLY indexing the printable pages and not the normal topic. Has anyone else come across this?
Lainaus käyttäjältä: retis - maaliskuu 25, 2009, 09:38:46 IP
Ok so i'm a little confused. Now i'm not correcting anyone in a hostile way but one of the developers in this thread stated that robots.txt only stops google from crawling the page not from indexing it. This is totally incorrect.
Actually, they said it doesn't stop
search engines from indexing the page:
Lainaus käyttäjältä: regularexpression - maaliskuu 04, 2009, 02:52:52 IP
Also remember robots.txt only prevents spidering, it does not prevent search engines from adding those links to its search index. (And note, if you block a page with robots.txt, the search engine will never spider the page to see the meta noindex).
Unless Yahoo!, Live, Ask, Baidu, Yandex, etc have all shut down without me knowing it, this statement still holds true. I believe the statement is specifically true with Yahoo!'s behavior.
Lainaus käyttäjältä: retis - maaliskuu 25, 2009, 09:38:46 IP
Ok so now that bits out of the way lol...The reason i ended up at this topic after searching through millions of other SMF seo specific stuff on here (which has been very useful) i have come across an issue similar to that of the original poster where google is indexing the printpages...however mines not a duplication issue and i have now blocked the printpages...but my issue is that it seems to be ONLY indexing the printable pages and not the normal topic. Has anyone else come across this?
I personally haven't seen that issue.
Lainaus käyttäjältä: Motoko-chan - maaliskuu 26, 2009, 02:30:45 AP
Actually, they said it doesn't stop search engines from indexing the page:
How's that level of pedantic working out for you? I thought this was a support forum for decent software not a kids bickering forum like the rest of the internet? And lets be totally honest when people are talking about seo it usually mainly relates to google, i can't say I've ever heard someone say "ooh lets yandex it, to find out"
Lainaus käyttäjältä: Motoko-chan - maaliskuu 26, 2009, 02:30:45 AP
Unless Yahoo!, Live, Ask, Baidu, Yandex, etc have all shut down without me knowing it, this statement still holds true. I believe the statement is specifically true with Yahoo!'s behavior.
Well since you enjoy being picky about things...which is it? the statement hold true or the statement holds true for Yahoo. You can't be pedantic in one sentence then contradict yourself in another, don't get me wrong your a developer here and smarter than me where coding is concerned, clearly, but lack of intelligence isn't just measured by how well a person can code. As you may have gathered i don't appreciated someone attempting to talk down to me when they know nothing about the type of person i am, especially when all i was giving was accurate information as far as i knew it based on research and actual tests.
For the record i AM correct in what i said, as i was talking directly about google.
However since
i am not a pedantic little boy, i will say after researching more on yahoo i have found that yahoo spider is pretty much identical to Googles and Live's now with the exception being...as you correctly pointed out if its disallowed in robots.txt then yahoo will
not crawl it.
What people need to realise is MSN Google and Live collaborate whenever updating their bots.
Oh, one last thing, since you have never seen the issue i was talking about, your whole post was fairly juvenile and pointless anyway, giving no real insight into anything other than pointing out something about yahoo when i was talking about google...Grow up, think before you speak, and don't speak down to people who you don't know just because you design the software!
First, sorry if I offended you.
Lainaus käyttäjältä: retis - maaliskuu 26, 2009, 04:30:56 AP
How's that level of pedantic working out for you? I thought this was a support forum for decent software not a kids bickering forum like the rest of the internet? And lets be totally honest when people are talking about seo it usually mainly relates to google, i can't say I've ever heard someone say "ooh lets yandex it, to find out"
Last I checked this was a serious forum.
I was simply trying to make sure you understood that the statement you mentioned was meant as a general behavior remark, not specific to one search engine. As you probably know, general remarks do often have exceptions.
Most of the places I follow for SEO discuss placing on other search engines as well, not just Google. I guess I'm just used to that being the case when bringing up the subject.
As for not hearing about Yandex, well, it's popular in Russia. Likewise Baidu in China. They also both index English-language sites. I added them in as an example of very popular search engines in other parts of the globe, and possible desirable engines to get ranked well in depending on what kind of audience you want.
Lainaus käyttäjältä: retis - maaliskuu 26, 2009, 04:30:56 AP
Well since you enjoy being picky about things...which is it? the statement hold true or the statement holds true for Yahoo.
I cannot say with certainty. I wasn't the person who did the huge survey of search engines. I do recall them saying that Y! did behave quite differently in indexing, so that is where I had my recollection.
Lainaus käyttäjältä: retis - maaliskuu 26, 2009, 04:30:56 AP
You can't be pedantic in one sentence then contradict yourself in another, don't get me wrong your a developer here and smarter than me where coding is concerned, clearly, but lack of intelligence isn't just measured by how well a person can code.
Actually, I'm not wearing a developer badge. I just help manage the project, so I haven't done any core coding.
Lainaus käyttäjältä: retis - maaliskuu 26, 2009, 04:30:56 AP
As you may have gathered i don't appreciated someone attempting to talk down to me when they know nothing about the type of person i am, especially when all i was giving was accurate information as far as i knew it based on research and actual tests.
Sorry if you felt I was offending you personally.
Lainaus käyttäjältä: retis - maaliskuu 26, 2009, 04:30:56 AP
Oh, one last thing, since you have never seen the issue i was talking about, your whole post was fairly juvenile and pointless anyway, giving no real insight into anything other than pointing out something about yahoo when i was talking about google...Grow up, think before you speak, and don't speak down to people who you don't know just because you design the software!
Once again sorry. I thought that perhaps I was being helpful. I guess not. I also figured that you would like evidence if someone hadn't seen your particular issue as well. I'll just remain silent now.
Apologies accepted. And i apologise for my rant, it was early and i was very tired, probably a little overboard. I appreciate you were trying to help.
I have confirmed this with Google in the past.
Robots.txt only prevents/stops crawling.
- It will NOT remove any existing links from Search Engine indexes. (only meta noindex or x-robots noindex can do that).
- IF there are sufficient backlinks internal or external pointing to certain links (that are blocked with robots.txt), Google & other search engines WILL add that link to its search index (WITHOUT ever crawling it - so it will be a link without title/description)
This is particular problem for dynamic scripts like smf with .msg links.
IF a page is meta noindexed and blocked by robots.txt. Google & other search engines will NEVER crawl the page, and so will never discover the meta noindex.
This is why for SEO reasons, it is BEST to let Google crawl your site, and let it discover the meta noindex.
I'd only use robots.txt for bandwidth savings, and/or blocking directories.
This situation between robots.txt vs meta noindex is often described as retarded.
Ideally several sites have PROPOSED a noindex: similar to disallow: for a future robots.txt standard
http://sebastians-pamphlets.com/standardization-of-rep-tags-as-robots-txt-directives/
no offence but the phrase "in the past" is correct. i have done tests over the past 5 weeks and keep upto date with the official google blogs and announcements.
Google WILL still crawl sites listed in robots.txt it will NOT index them and the only ways to successfully remove pages from google is to post a removal request and then block it with either noindex, robots.txt or a 404.
Apologies if i come across argumentative but you guys designed the software, NOT the search engine spiders and clearly haven't done the research into what the spiders do and do not do, properly. A 2 second search on google or yahoo brings up this information direct from the horses mouth. If you wish to reply again and say i'm wrong, well then clearly i'm getting nowhere and will let you continue to misinform people and this thread can join the other misinformative threads throughout the internet.
Retis,
I'd recommend trying it for yourself
1. block .msg links with a robots.txt
Disallow: /index.php?topic=*.msg
2. If you have spider logging software scripts logging spider + page requested (after 24 hours of being in place)
Google won't ever spider another .msg link again
Let some time pass eg 24hrs
3. Using site:{site_name} AND inurl:".msg" you WILL be able to find .msg links that have been added to their index in the last 24hrs, but without a title and description.
Having discovered this issue. I contacted google via their google webmasters forum, and was told this fact.
Indexing and Crawling are two separate things (alot of people get confused or believe them to be the same thing when they are not).
Crawling - the spider browsing your site
Indexing - adding of a link to the search engine db (IF the page wasn't crawled, the link ONLY will be added to the search engine db).
I started to write a proper reply but stopped and decided to just give up and go on one of the unofficial help forums for SMF. As for trying this for myself...apologies if this comes across as rude but i have no interest in repeating myself constantly when i have already said 3 times THIS IS BASED ON TESTS I HAVE DONE OVER THE PAST 5 WEEKS.
I understand it's probably inappropriate to use caps and i may get banned etc etc, but it seems that people skim through the posts and take out the bits that they need to make a reply that will try and force people to come to the same flawed conclusion.
Bottom line...i have tested, and i know what i know. Since nobody is even bothering to respond in anyway to my actual query i'll find the answers elsewhere and won't be back on the forum.
I do however appreciate the SMF software, find it very user friendly, and will be starting my second site utilising it...assuming i can get google to index posts instead of profiles, boards, printable versions, my grans dog etc.
umm...thanks anyway?! I guess ::)
With respect Retis
- I've done extensive SEO testing with SMF, and search engines. (more than a years worth in fact)
- I've contacted Google and I've told you what they explained to me.
- I regularly read google and other respected seo blogs (and have posted on them in fact)
As with most SEO stuff. You really have to try it for yourself as I have done.
5 weeks in SEO terms isn't very long. With all the changes I performed on my forum, it was nearer 3 months before I gained most benefit.
i'm not saying you havent done SEO on SMF, what i am saying is that i have done tests for myself and the results are clear.
You also said the test should take 24 hrs to show up
"Let some time pass eg 24hrs
3. Using site:{site_name} AND inurl:".msg" you WILL be able to find .msg links that have been added to their index in the last 24hrs, but without a title and description."
Then you say 5 weeks ins't exactly long?! Surely you can see my frustration and the constant sidestepping, backtracking and contradictions in people posts in this topic?
I am by no means an expert at SEO however what i can say is that 2 years ago i ran a busines online and may as well have been none existent on google for my chosen search terms showing up on page 200 million. When i optimised keywords, meta tags, alt tags and H1 tags within 2 weeks i was second on google and 1st on Yahoo and MSN. It's not really rocket science, and i wasn't asking for help on robots.txt in the first place.
My question was why do only the printpage topics get indexed and not the normal posts...especially since they have the noindex metatag. If someone can give even an educated guess that would be great, if not...thats fine, i really don't care about anything else and i'll just fiddle myself, see how it goes for a few more weeks and if i get no joy then i'll switch forum software which i don't particularly want to do as i truly like SMF.
This whole debate is wasting both of our time and is quite frankly, irrelevant, you have your views based on whatever, and i have mine based on very recent results.
OK well i've figured out my own problem, however now i have notice something unusual.
The topics are now being indexed but some of them are indexing twice...this is also the case when making a sitemap...i get...
http://www.xtremegamernet.com/index.php?topic=153
AND
http://www.xtremegamernet.com/index.php?topic=153.0
Any ideas why its picking up 2? Ive put the mouse over every link i can find and they all say 153.0, where are the ones without the decimal point in coming from?
Lainaus käyttäjältä: karlbenson - maaliskuu 04, 2009, 02:52:52 IP
Robots.txt there is only use for saving bandwidth. If you want seo/duplicate content indexing, then don't use a robots.txt, but meta noindex any pages that are duplicate content (which aren't already noindexed).
P.S SMF 2.0 noindexes duplicate content much better than smf 1.1.x
I installed SMF on a new domain only yesterday and google has indexed 52 pages of it: mostly stuff like action=register, action=search , action=reply etc.
I have added robot.txt.
Posts can be noindexed but how do i noindex these actions which are effecting SEO?
Lainaus käyttäjältä: retis - maaliskuu 25, 2009, 09:38:46 IP
This has also been tested by me recently where i have put "disallow: index.php?board*" as i don't need the board index to be indexed, and it still will follow the board links to get to the topics, googles dash also logs what is and isn't restricted. Sorry to correct but google will still crawl the page for links it will just ignore it when indexing.
first of all thanks for robots.txt calrification. I was really misled for some times into thinking that robots.txt is just a nofollow metatag. (statred making desparate searches on how to noindex directories)
Now, Have a question: Why woulld you do this - "disallow: index.php?board*"
Is this some sort of page sculpting for optimising flow of link juices ?
Is it benefical for improving the seo of a forum ?
Sorry am a total noob.
well I'd probably allow ?boards or ?topic
remember the * at the end is implied and unnecessary.
Remember the most efficient way for google to spider your site is with an xml sitemap.