Uutiset:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu
Advertisement:

Why is googlebot ignoring my robots.txt file?

Aloittaja frostipuff, kesäkuu 20, 2011, 06:57:54 AP

« edellinen - seuraava »

frostipuff

I'd been having issues with my forum content being indexed as Print Page (instead of display page), so I researched here and discovered I should modify my robots.txt file.

Here's what it says now:

User-agent: *
Disallow: /forum/Sources
Disallow: /forum/Smileys
Disallow: /forum/Packages
Disallow: /forum/avatars
Disallow: /forum/attachments
Disallow: /forum/Themes
Disallow: /forum/index.php?action=printpage
Disallow: /forum/index.php?action=stats
Disallow: /forum/index.php?action=help
Disallow: /forum/index.php?action=search
Disallow: /forum/index.php?action=mlist
Disallow: /forum/index.php?action=post
Disallow: /forum/index.php?action=profile;area=showposts;u=*
Disallow: /forum/index.php?action=profile;area=showposts;sa=attach;u=*
Disallow: /forum/index.php?wap2


My data resides at /public_html/forum and the robot.txt and robots.txt files are in public_html. Should I also have a robots.txt file at the /forum level?

I made these change over a week ago, and I realize it will take time before I notice the Display Page coming up in searches, instead of Print Page, but shouldn't these indexers be paying attention to my robots file?  Googlebot, in particular, is still "Printing blah blah topic" when I monitor who's online.

Is there anything else I can do? I'd prefer not to disable the Print Topic functionality, as I find it really useful for archiving long posts.

kat

Hmmm... I think you need to make a slight alteration to each line, adding a "/" to the end.

Although, I believe that they paths have to be absolute.

So, "Disallow: /forum/index.php?action=printpage" won't work, whereas:

"Disallow: /forum/index.php/" will.

More info:

http://www.robotstxt.org/robotstxt.html

frostipuff

Ah! Thanks, I will try that and report back. Don't want to close this topic yet, just in case.

kat

I'm not 100% sure, myself, to be honest.

So, yeah, try it out, before you mark this "Solved". ;)

frostipuff

So do I lop off everything after /forum/index.php?

In other words, do this:

/forum/index.php/

Not this:

/forum/index.php/?action=printpage

I'm guessing the former 1) because the second isn't a valid URL and 2) because adding the forward slash after php disallows all the actions on index.php. Therefore, I could shorten my robots.txt file to look like this:

Disallow: /forum/Sources/
Disallow: /forum/Smileys/
Disallow: /forum/Packages/
Disallow: /forum/avatars/
Disallow: /forum/attachments/
Disallow: /forum/Themes/
Disallow: /forum/index.php/

kat

Yeah, that looks fine, to me.

All that "action=" stuff is SMF stuff, not true paths.

frostipuff

I am more into databases than web stuffs, and I only know the basics of php (haven't had to learn too much because SMF does what I need), but this change looks like it might be too aggressive.

If I include this in my robots file:

Disallow: /forum/index.php/

Won't that prevent good spiders from indexing anything on my forum?

Here's a typical URL, where in fact, all topics come after index.php:

theeverydaybeauty.com/forum/index.php/board,3.0.html

I am changing the file to say this and see how it goes:

User-agent: *
Disallow: /forum/Sources/
Disallow: /forum/Smileys/
Disallow: /forum/Packages/
Disallow: /forum/avatars/
Disallow: /forum/attachments/
Disallow: /forum/Themes/
Disallow: /forum/index.php?action=printpage/
Disallow: /forum/index.php?action=stats/
Disallow: /forum/index.php?action=help/
Disallow: /forum/index.php?action=search/
Disallow: /forum/index.php?action=mlist/
Disallow: /forum/index.php?action=post/
Disallow: /forum/index.php?action=profile;area=showposts;u=*/
Disallow: /forum/index.php?action=profile;area=showposts;sa=attach;u=*/
Disallow: /forum/index.php?wap2/


If anyone sees a problem with those URLs, please let me know.

kat

OOPS! Yes, the index.php one will block your whole forum.

Sorry, for that!

I don't think you need all the ones with "action" in.

frostipuff

Lainaus käyttäjältä: K@ - kesäkuu 20, 2011, 08:11:32 AP
OOPS! Yes, the index.php one will block your whole forum.

Sorry, for that!

Heh, no problem. I am fully caffeinated and on my toes.

LainaaI don't think you need all the ones with "action" in.

OK, fair enough, but how will I keep the bots from accessing the Print Page utility if I don't include this:

Disallow: /forum/index.php?action=printpage/

Is there another way?

I'm trying to prevent this in search results:

kat

Why worry?

They won't be printing any thing, after all. ;)

frostipuff

I'm not worried about what these bots print. My concern is that the are indexing the Print Page results instead of the display page, so when the person issuing the search gets results, he or she cannot click on it and go directly to the post. They see a flat and static page.

kat

Ah.

I think that might be best dealt with using .htaccess.

I don't know how, though, to be honest.

I'll do some reasearch. :)


frostipuff

Interesting.

It's very close to what I had before, so I wonder why the bots were ignoring the "don't print page" line.

Thanks for the link.

frostipuff

OK, I think I might see where I screwed up, but I am not 100% sure this will fix it.

I've been disallowing /forum/action-stuff

but the forum I'm talking about here is a subdomain; thus, content is one level deeper, in /blahblahblah.com/forum/  <-- fake domain, obviously

So here's the dumb question: should I add that domain before each instance of /forum in the disallow list? Like this:

User-agent: *
Disallow: /blah.com/forum/index.php?action=activate
Disallow: /blah.com/forum/index.php?action=admin
Disallow: /blah.com/forum/index.php?action=arcade
Disallow: /blah.com/forum/index.php?action=calendar
Disallow: /blah.com/forum/index.php?action=collapse
Disallow: /blah.com/forum/index.php?action=deletemsg
Disallow: /blah.com/forum/index.php?action=editpoll
Disallow: /blah.com/forum/index.php?action=help
Disallow: /blah.com/forum/index.php?action=helpadmin
Disallow: /blah.com/forum/index.php?action=lock
Disallow: /blah.com/forum/index.php?action=login
Disallow: /blah.com/forum/index.php?action=logout
Disallow: /blah.com/forum/index.php?action=markasread
Disallow: /blah.com/forum/index.php?action=mergetopics
Disallow: /blah.com/forum/index.php?action=mlist
Disallow: /blah.com/forum/index.php?action=modifykarma
Disallow: /blah.com/forum/index.php?action=movetopic
Disallow: /blah.com/forum/index.php?action=notify
Disallow: /blah.com/forum/index.php?action=notifyboard
Disallow: /blah.com/forum/index.php?action=pm
Disallow: /blah.com/forum/index.php?action=post
Disallow: /blah.com/forum/index.php?action=printpage
Disallow: /blah.com/forum/index.php?action=profile
Disallow: /blah.com/forum/index.php?action=profile;area=showposts;u=*
Disallow: /blah.com/forum/index.php?action=profile;area=showposts;sa=attach;u=*
Disallow: /blah.com/forum/index.php?action=register
Disallow: /blah.com/forum/index.php?action=removetopic2
Disallow: /blah.com/forum/index.php?action=reporttm
Disallow: /blah.com/forum/index.php?action=search
Disallow: /blah.com/forum/index.php?action=sendtopic
Disallow: /blah.com/forum/index.php?action=splittopics
Disallow: /blah.com/forum/index.php?action=stats
Disallow: /blah.com/forum/index.php?action=sticky
Disallow: /blah.com/forum/index.php?action=trackip
Disallow: /blah.com/forum/index.php?action=unread
Disallow: /blah.com/forum/index.php?action=unreadreplies
Disallow: /blah.com/forum/index.php?wap2
Disallow: /blah.com/forum/index.php?action=who
Disallow: /blah.com/forum/attachments/
Disallow: /blah.com/forum/avatars/
Disallow: /blah.com/forum/Packages/
Disallow: /blah.com/forum/Smileys/
Disallow: /blah.com/forum/Sources/
Disallow: /blah.com/forum/Themes/
Disallow: /blah.com/forum/*.msg


If someone needs the real domain to help me troubleshoot, just let me know.

kat

I would've thought that they needed the abolute paths, yes.

Looking at that page, that's how I read it, anyway.

I'm sorry that I led you astray, at first.

I was relying on experience, at the time.

It appears that my experiences, of this, weren't as wide-reaching as I'd thought. ;)

frostipuff

No problem. In my day job, I am used to questioning developers. ;)

I will keep an eye on the who's online page and hope I don't see any of the bots printing! If yes, I will resolve this bug issue.  <-- See? I'm in a bug bashing frame of mind, lol.

Edit.

kat

Sounds good.

Love the username, by the way.

"Different". :)

frostipuff

Modifying the robots.txt isn't stopping the bots from using the Print Page action to index content on my forum.

Right now I have Guest IP addresses that resolve to AT&T Internet Services and Comcast. I suppose those could be real lurkers actually printing topics, but I doubt it because one of the topics shows up as Print Page results in a Google search.

I really want to stop this.

Any other ideas?

kat

Yep! :)

http://www.robotstxt.org/meta.html

LainaaThe most explicit way to disallow all references to the print pages is to use the robots noindex meta tag, but not also to use robots.txt. If you use robots.txt to block a print page, it is true that Google won't spider the content of the page, but that prevents Google from reading the robots meta nofollow on the page. In that event, Google will index the url of the print page if the link is publicly available, albeit it won't show any content from the page. This url may or may not end up appearing in a Google search.

By itself, the noindex attribute in the link to the page won't prevent the page from being indexed by Google, because it's possible for Google to follow another link to this page, or to spider it via urls it finds in publicly accessible log files.

And... if you had the noindex meta on the print page, but you had a noindex attribute in the link to the print page, and that was the only link to the print page, it's not clear how that would affect Google's indexing the url. ;) Google wouldn't spider the print page and therefore wouldn't see the noindex meta. Google would index the link anchor text as part of the source page content, but opinions and logic varies on whether Google would index that text as a link anchor.

http://www.webmasterworld.com/google/3546100.htm

Advertisement: