Googlebot onclick crawling (was:"Googlebot issues")

Started by cortez, February 25, 2014, 10:21:22 AM

Previous topic - Next topic

cortez

I am getting a lot of crawling "access denied" issues in stats. Something like this:

http://www.autobusi.org/forum/index.php?topic=150.%1$d
http://www.autobusi.org/forum/index.php?topic=8564.%1$d
http://www.autobusi.org/forum/index.php?topic=8419.msg%msg_id%
http://www.autobusi.org/forum/index.php?topic=1440.msg%msg_id%


If those %1$d  and %msg_id% are removed, links open just fine.

My robots.txt looks like this:

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot-Mobile
Allow: /forum/*wap
Allow: /forum/*wap2
Disallow: /


User-agent: *
Disallow: /forum/*prev_next*
Disallow: /forum/*action*
Disallow: /forum/*sort*
Disallow: /forum/*msg*
Disallow: /forum/*prev_next*
Disallow: /forum/*;all
Disallow: /forum/*;imode
Disallow: /forum/*wap
Disallow: /forum/*wap2
Disallow: /forum/Themes/
Sitemap: http://www.autobusi.org/forum/sitemap.xml


What am I missing/doing wrong?

Relevant stuff:
- Optimus Brave installed
- SMF SEF setting disabled
- SMF 2.0.7

Thanks.

Kindred

Looks like something is creating bad links that the spiders find and choke on...

Each of those links has an end bit that is the php variable instead of the value for that variable....
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

cortez

I think I've figured it out...

At some time I was using that SEF thing (which you told me once in Optimus Brave thread that is stupid and useless) and there were some post links in that .html form on forum.

I've googled all *.html links on my forum and edited all posts containing internal links in that format. This really took time but hopefully I will not have this issue anymore since obvious that internal links were misleading bots.

Any advices on robots.txt?

Thanks for help.

Kindred

not sure if that will fix it, since the its that you quoted originally wre fairly clearly bits of code which got displayed as code rather than being parsed as it should have been....   but if it doesn't happen again, don't worry about it.

As for robots.txt... that should work just fine... but it's not really necessary either....  I don't have any of that in my robots.txt and my site gets crawled just fine. :)
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

dheeraj

Seems like, maybe your sitemap is generating those bad urls. I used to generate xml sitemap from xml-sitemaps site, and which gave me unwanted urls and so too many errors in GWT.

cortez

Unfortunately, it isn't sitemap what is generating those bad links. I've removed Optimus Brave sitemap both from webmaster tools, site and robots.txt just to be sure and let it couple of days to refresh it.

It's still generating:

http://www.autobusi.org/forum/index.php?topic=121.%1$d
http://www.autobusi.org/forum/index.php?topic=33.%1$d
Etc...

Raw access logs confirm that it's fetching those bad url's:
66.249.64.196 - - [02/Mar/2014:16:20:13 +0100] "GET /forum/index.php?topic=9334.%1$d HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


Also, I have repeated one not found URL:
URL:   http://www.autobusi.org/a%3E
Which says it's "linked from" this:
http://www.autobusi.org/forum/index.php?topic=543.170
http://www.autobusi.org/forum/index.php?topic=543.165
http://www.autobusi.org/forum/index.php?topic=1079.25
http://www.autobusi.org/forum/index.php?topic=6236.15

Strange stuff.


Oh, complete mod list:
1.    Simple Audio Video Embedder    2.4.1
2.    Join date and Location in Posts    1.3.1
3.    Optimus Brave    1.8.7
4.    Avatar Rounded Corners    1.0
5.    Add Facebook Like, Tweet, and Google +1    1.0.3a
6.    Users Online Today    2.0.2
7.    Ad Managment    3.1a
8.    The Rules    1.3
9.    SMF 2.0.7 Update    1.0
10.    Users mass actions    0.1.1
11.    Tapatalk SMF 2.0 Plugin    3.9.3

Kindred

Which is what I said in response to you several days ago...


I don't think it is any of those mods... I think you have a badly manually edited piece of code somewhere in your system.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

cortez

Alright, it's time for uninstalling all mods and running large upgrade package... Fingers crossed.

cortez

OK then...

1. Uninstalled all mods
2. Downloaded language packs just to be sure
3. Ran large upgrade package
4. Restored some of the necessary mods
5. Set search engine tracking level at very high

Spider log:
Google    Today at 20:46:44    a:3:{s:5:"topic";i:211;s:5:"board";i:19;s:10:"USER_AGENT";s:196:"Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot

Google    Today at 21:11:32    a:4:{s:5:"topic";i:4943;s:4:"wap2";s:0:"";s:5:"board";i:32;s:10:"USER_AGENT";s:169:"SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)";

Is this how it's supposed to look like or a bug(I've never used spider tracking before)?  Because there are normal entries there as well:
Google    Today at 20:56:18    Viewing the topic Paris.
Google (AdSense)    Today at 21:20:03    Viewing the topic Mercedes Benz Citaro II.

Thanks.

cortez

No luck, same errors.

Could this be related with 2.0.7 and PHP 5.3? Since I'm running it both (5.3.28 to be precise).

I'll rollback to 2.0.6.

cortez

To finally end this monologue... It's error in SMF.

Got a reply from Google team:

Apparently my page source contains those invalid URL's, for example:
http://www.autobusi.org/forum/index.php?topic=4505.105
In source (twice): onclick="expandPages(this, 'http://www.autobusi.org/forum/index.php'+'?topic=4505.%1$d';, 15, 75, 15)


You can find same thing here for example:

http://www.simplemachines.org/community/index.php?debug;topic=517205.msg3670703;boardseen#new
In source (twice): onclick="expandPages(this, 'http://www.simplemachines.org/community/index.php'+'?topic=517205.%1$d';, 20, 100, 20)

So much effort for nothing, heh.

Kindred

ummmm....   but why would a googlebot be trying to read the URL out of javascript onclick event.   That doesn't make any sense...      Google is smarter than that and doesn't catalog onclick events

and that most certainly is **NOT** an error in SMF.   That is a perfectly VALID onclick call.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

cortez

I don't know, I just got a reply that I check my page source because that links are cause of the errors.

You know more than me, that's for sure.

Kindred

well it's odd.... because you are right.   That code does match the errors that you are getting...   on the other hand, I have used google webmaster tools for ages and have never gotten those errors (despite, as you note, the code being present in every post page)   Sorry for not recognizing it earlier... but I have never looked at how the pages 1 2 ... 8 9 gets fulfilled in HTML before...   

that code is definitely there...

that code is definitely acceptable code though and has never triggered any errors on any other SMF site that I have heard about.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

cortez

No need for apologizing, it's free software in the end. :)

Now the question is what to do to alleviate this issue to less affect site ranking...

cortez

Saga continues...

I've been using redirect in .htaccess which immediately points out to /forum.

RewriteEngine on
RewriteCond %{REQUEST_URI} ^/$
RewriteRule (.*) http://www.autobusi.org/forum [R=301,L]

Could that cause this? I've been using it for ages, almost forgot that it exists.

cortez

Nope, it wasn't that. Sigh, I officially give up now.

Apparently, googlebot started crawling onclick events quite some time ago (2009). Question is, how to form a rule to prevent it as a temporary measure?

Kindred

That makes no sense at all. Google should not be crawling on click events....   And as I said, if they did, then that same error would be present on every single smf installation ever indexed by google....

There is something odd about how your site got indexed.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

cortez

Related to SMF or not, they're crawling onclick events... With reasoning that webmasters often "unintentionally hide" links from crawlers.

http://www.searchenginepeople.com/blog/onclick.html

There are posts from 2010 about that (quite interesting discussion, though):
http://www.webmasterworld.com/google/4159807.htm

Even 2009:
http://searchengineland.com/google-io-new-advances-in-the-searchability-of-javascript-and-flash-but-is-it-enough-19881

I'll try to make support request to them if they can avoid doing that on my site. Although pretty much expecting any support from Google is timewasting.




Gwenwyfar

Though this does seem an issue with google, just leaving this here to say this is not an isolated case, as I get the same thing every once in a while. Got a bunch of those errors, was dismissing them before as server DOS block on google (no members ever have issues, so). This only happens on topics that have a lot of pages, so could be why some forums don't have it (or rarely), since some forums are more serious and discussions don't grow that big. Or could just be individual posts per page settings not letting it have many.

Maybe blocking the "%1$d" part on robots.txt would work as a solution?

Advertisement: