Simple Machines Community Forum

SMF Support => SMF 1.1.x Support => Aiheen aloitti: frostipuff - kesäkuu 20, 2011, 06:57:54 AP

Otsikko: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 06:57:54 AP
I'd been having issues with my forum content being indexed as Print Page (instead of display page), so I researched here and discovered I should modify my robots.txt file.

Here's what it says now:

User-agent: *
Disallow: /forum/Sources
Disallow: /forum/Smileys
Disallow: /forum/Packages
Disallow: /forum/avatars
Disallow: /forum/attachments
Disallow: /forum/Themes
Disallow: /forum/index.php?action=printpage
Disallow: /forum/index.php?action=stats
Disallow: /forum/index.php?action=help
Disallow: /forum/index.php?action=search
Disallow: /forum/index.php?action=mlist
Disallow: /forum/index.php?action=post
Disallow: /forum/index.php?action=profile;area=showposts;u=*
Disallow: /forum/index.php?action=profile;area=showposts;sa=attach;u=*
Disallow: /forum/index.php?wap2


My data resides at /public_html/forum and the robot.txt and robots.txt files are in public_html. Should I also have a robots.txt file at the /forum level?

I made these change over a week ago, and I realize it will take time before I notice the Display Page coming up in searches, instead of Print Page, but shouldn't these indexers be paying attention to my robots file?  Googlebot, in particular, is still "Printing blah blah topic" when I monitor who's online.

Is there anything else I can do? I'd prefer not to disable the Print Topic functionality, as I find it really useful for archiving long posts.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 07:21:11 AP
Hmmm... I think you need to make a slight alteration to each line, adding a "/" to the end.

Although, I believe that they paths have to be absolute.

So, "Disallow: /forum/index.php?action=printpage" won't work, whereas:

"Disallow: /forum/index.php/" will.

More info:

http://www.robotstxt.org/robotstxt.html
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 07:33:54 AP
Ah! Thanks, I will try that and report back. Don't want to close this topic yet, just in case.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 07:37:33 AP
I'm not 100% sure, myself, to be honest.

So, yeah, try it out, before you mark this "Solved". ;)
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 07:55:58 AP
So do I lop off everything after /forum/index.php?

In other words, do this:

/forum/index.php/

Not this:

/forum/index.php/?action=printpage

I'm guessing the former 1) because the second isn't a valid URL and 2) because adding the forward slash after php disallows all the actions on index.php. Therefore, I could shorten my robots.txt file to look like this:

Disallow: /forum/Sources/
Disallow: /forum/Smileys/
Disallow: /forum/Packages/
Disallow: /forum/avatars/
Disallow: /forum/attachments/
Disallow: /forum/Themes/
Disallow: /forum/index.php/
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 07:59:43 AP
Yeah, that looks fine, to me.

All that "action=" stuff is SMF stuff, not true paths.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 08:07:39 AP
I am more into databases than web stuffs, and I only know the basics of php (haven't had to learn too much because SMF does what I need), but this change looks like it might be too aggressive.

If I include this in my robots file:

Disallow: /forum/index.php/

Won't that prevent good spiders from indexing anything on my forum?

Here's a typical URL, where in fact, all topics come after index.php:

theeverydaybeauty.com/forum/index.php/board,3.0.html

I am changing the file to say this and see how it goes:

User-agent: *
Disallow: /forum/Sources/
Disallow: /forum/Smileys/
Disallow: /forum/Packages/
Disallow: /forum/avatars/
Disallow: /forum/attachments/
Disallow: /forum/Themes/
Disallow: /forum/index.php?action=printpage/
Disallow: /forum/index.php?action=stats/
Disallow: /forum/index.php?action=help/
Disallow: /forum/index.php?action=search/
Disallow: /forum/index.php?action=mlist/
Disallow: /forum/index.php?action=post/
Disallow: /forum/index.php?action=profile;area=showposts;u=*/
Disallow: /forum/index.php?action=profile;area=showposts;sa=attach;u=*/
Disallow: /forum/index.php?wap2/


If anyone sees a problem with those URLs, please let me know.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 08:11:32 AP
OOPS! Yes, the index.php one will block your whole forum.

Sorry, for that!

I don't think you need all the ones with "action" in.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 08:17:29 AP
Lainaus käyttäjältä: K@ - kesäkuu 20, 2011, 08:11:32 AP
OOPS! Yes, the index.php one will block your whole forum.

Sorry, for that!

Heh, no problem. I am fully caffeinated and on my toes.

LainaaI don't think you need all the ones with "action" in.

OK, fair enough, but how will I keep the bots from accessing the Print Page utility if I don't include this:

Disallow: /forum/index.php?action=printpage/

Is there another way?

I'm trying to prevent this in search results:
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 08:23:46 AP
Why worry?

They won't be printing any thing, after all. ;)
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 08:46:42 AP
I'm not worried about what these bots print. My concern is that the are indexing the Print Page results instead of the display page, so when the person issuing the search gets results, he or she cannot click on it and go directly to the post. They see a flat and static page.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 08:48:34 AP
Ah.

I think that might be best dealt with using .htaccess.

I don't know how, though, to be honest.

I'll do some reasearch. :)
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 09:27:55 AP
Looks like I may have been totally wrong. :(

http://www.veign.com/blog/2007/10/06/robots-txt-file-for-an-smf-forum/
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 11:15:52 AP
Interesting.

It's very close to what I had before, so I wonder why the bots were ignoring the "don't print page" line.

Thanks for the link.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 11:55:57 AP
OK, I think I might see where I screwed up, but I am not 100% sure this will fix it.

I've been disallowing /forum/action-stuff

but the forum I'm talking about here is a subdomain; thus, content is one level deeper, in /blahblahblah.com/forum/  <-- fake domain, obviously

So here's the dumb question: should I add that domain before each instance of /forum in the disallow list? Like this:

User-agent: *
Disallow: /blah.com/forum/index.php?action=activate
Disallow: /blah.com/forum/index.php?action=admin
Disallow: /blah.com/forum/index.php?action=arcade
Disallow: /blah.com/forum/index.php?action=calendar
Disallow: /blah.com/forum/index.php?action=collapse
Disallow: /blah.com/forum/index.php?action=deletemsg
Disallow: /blah.com/forum/index.php?action=editpoll
Disallow: /blah.com/forum/index.php?action=help
Disallow: /blah.com/forum/index.php?action=helpadmin
Disallow: /blah.com/forum/index.php?action=lock
Disallow: /blah.com/forum/index.php?action=login
Disallow: /blah.com/forum/index.php?action=logout
Disallow: /blah.com/forum/index.php?action=markasread
Disallow: /blah.com/forum/index.php?action=mergetopics
Disallow: /blah.com/forum/index.php?action=mlist
Disallow: /blah.com/forum/index.php?action=modifykarma
Disallow: /blah.com/forum/index.php?action=movetopic
Disallow: /blah.com/forum/index.php?action=notify
Disallow: /blah.com/forum/index.php?action=notifyboard
Disallow: /blah.com/forum/index.php?action=pm
Disallow: /blah.com/forum/index.php?action=post
Disallow: /blah.com/forum/index.php?action=printpage
Disallow: /blah.com/forum/index.php?action=profile
Disallow: /blah.com/forum/index.php?action=profile;area=showposts;u=*
Disallow: /blah.com/forum/index.php?action=profile;area=showposts;sa=attach;u=*
Disallow: /blah.com/forum/index.php?action=register
Disallow: /blah.com/forum/index.php?action=removetopic2
Disallow: /blah.com/forum/index.php?action=reporttm
Disallow: /blah.com/forum/index.php?action=search
Disallow: /blah.com/forum/index.php?action=sendtopic
Disallow: /blah.com/forum/index.php?action=splittopics
Disallow: /blah.com/forum/index.php?action=stats
Disallow: /blah.com/forum/index.php?action=sticky
Disallow: /blah.com/forum/index.php?action=trackip
Disallow: /blah.com/forum/index.php?action=unread
Disallow: /blah.com/forum/index.php?action=unreadreplies
Disallow: /blah.com/forum/index.php?wap2
Disallow: /blah.com/forum/index.php?action=who
Disallow: /blah.com/forum/attachments/
Disallow: /blah.com/forum/avatars/
Disallow: /blah.com/forum/Packages/
Disallow: /blah.com/forum/Smileys/
Disallow: /blah.com/forum/Sources/
Disallow: /blah.com/forum/Themes/
Disallow: /blah.com/forum/*.msg


If someone needs the real domain to help me troubleshoot, just let me know.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 12:17:36 IP
I would've thought that they needed the abolute paths, yes.

Looking at that page, that's how I read it, anyway.

I'm sorry that I led you astray, at first.

I was relying on experience, at the time.

It appears that my experiences, of this, weren't as wide-reaching as I'd thought. ;)
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 12:29:25 IP
No problem. In my day job, I am used to questioning developers. ;)

I will keep an eye on the who's online page and hope I don't see any of the bots printing! If yes, I will resolve this bug issue.  <-- See? I'm in a bug bashing frame of mind, lol.

Edit.
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 20, 2011, 12:30:33 IP
Sounds good.

Love the username, by the way.

"Different". :)
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: frostipuff - kesäkuu 20, 2011, 05:02:37 IP
Modifying the robots.txt isn't stopping the bots from using the Print Page action to index content on my forum.

Right now I have Guest IP addresses that resolve to AT&T Internet Services and Comcast. I suppose those could be real lurkers actually printing topics, but I doubt it because one of the topics shows up as Print Page results in a Google search.

I really want to stop this.

Any other ideas?
Otsikko: Re: Why is googlebot ignoring my robots.txt file?
Kirjoitti: kat - kesäkuu 21, 2011, 06:26:13 AP
Yep! :)

http://www.robotstxt.org/meta.html

LainaaThe most explicit way to disallow all references to the print pages is to use the robots noindex meta tag, but not also to use robots.txt. If you use robots.txt to block a print page, it is true that Google won't spider the content of the page, but that prevents Google from reading the robots meta nofollow on the page. In that event, Google will index the url of the print page if the link is publicly available, albeit it won't show any content from the page. This url may or may not end up appearing in a Google search.

By itself, the noindex attribute in the link to the page won't prevent the page from being indexed by Google, because it's possible for Google to follow another link to this page, or to spider it via urls it finds in publicly accessible log files.

And... if you had the noindex meta on the print page, but you had a noindex attribute in the link to the print page, and that was the only link to the print page, it's not clear how that would affect Google's indexing the url. ;) Google wouldn't spider the print page and therefore wouldn't see the noindex meta. Google would index the link anchor text as part of the source page content, but opinions and logic varies on whether Google would index that text as a link anchor.

http://www.webmasterworld.com/google/3546100.htm