Cron.php

Started by gkawa, May 23, 2024, 07:36:22 AM

Previous topic - Next topic

gkawa

I apologize if it's a silly question. I tried to find more information about Cron.php and nothing came up.

As I understand it, Cron.php does all the time-based tasks and it's called from a java script as the users browse the forum. And it's safe to call it at any time. In fact, it is recommended to call it from a server chron job, to keep it running even if no one is accessing the forum.

My question is if anyone knows why is Googlebot constantly asking for Cron.php.

It's not a big deal, not a safety concern. I'm just puzzled. It can't get that URL from the site pages. The source code doesn't have it. But the worst part is that it's calling a URL that never existed.

The forum was originally installed on the root directory. So, the URL was /cron.php. More than 10 years ago it was moved to /forum. Googlebot is asking for /chron.php every single day, every minute or so. I thought it was checking URLs from way back. But the ts is current! It doesn't match the timestamp of the request, they're not serialized in any way. It's anything from an hour ago to a week. It couldn't be me, even if the cron.php URL were on the page they would be all pointing to /forum. Even worse, Googlebot is not asking for /forum/cron.php.

Where is it getting those URLs? And why?..

shawnb61

I believe that googlebot is emulating user activity in the way they crawl - even invoking js.  So yes, they trigger cron.   

I don't think many other crawlers do that.

That should be a good thing, as it helps ensure scheduled tasks & mailings will all happen timely, even with zero activity.

If only it weren't buggy.  Yes, if your forum is in a subfolder, it seems to not invoke cron properly - it tries to run it at root.  But it's not there...  ::)

That, I think, is a bug in googlebot.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

gkawa

Yes, it could be a good thing... if it calls the right URL for a change! :D
Not that it makes a real difference. If your forum is very active, the same activity is triggering cron. If it's not, there's no need to.

But Googlebot doing that? Is it just me or it's completely bonkers?
And, as far as I can see, no other bot is doing that. Maybe Google knows something I don't.

I was wondering if someone else noticed that behavior. Probably not if Googlebot is using the right cron URL. I found it because it's filling my error logs.

Kindred

Googlebot is not hitting the cron.php itself....  it's triggering it as any guest would

It's just acting differently from other bots
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

gkawa

Quote from: Kindred on May 24, 2024, 12:55:24 PMGooglebot is not hitting the cron.php itself....  it's triggering it as any guest would

It's just acting differently from other bots
It's not triggering, in this case. The issue here is that I found this because Google is calling a cron.php that doesn't exist with a current timestamp. To trigger it, the index.php page should be available and it's not.

I checked the logs

This is a normal triggering, a user browsing the forum and calling cron.php from there.
[nofollow]
This is Googlebot triggering, about the same except it doesn't call all the objects.
[nofollow]
This is Googlebot calling the wrong cron.php out of the blue. I thought it could be a delayed call, it's about 15 seconds off, compared with the previous call from a user. But I couldn't find a call to index.php around that time, the right one or the wrong one.
[nofollow]
This is Googlebot calling the wrong cron.php with a TS from almost a year ago. The forum was migrated from root more than 10 years ago.
[nofollow]
And here's what I think could be the root of the problem. Googlebot automatically created the wrong URL from the right one. Both calls with the same TS.
[nofollow]

My theory, Googlebot is hitting dynamic URLs that it (with its marvelous AI...) assumes may have a timestamp parameter to get the most current response. It's wrong, but... Google...
It has the cron.php stored from years ago and keeps it because it knows about the new one and thinks it may be a more canonical form. Does it make sense?
It doesn't explain why it's keeping the URLs it creates, even after a 404. That's craziness, according to Rita Mae Brown...

Kindred

Again... nothing should be hitting cron.php diesctly.

That's not a url that Google should ever hit.

Cron.php is triggered by activity in the forum.  So, Googlebot is browsing through the forum and that activity it triggering cron.php to run...

And not every page view triggers cron (I believe)
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

shawnb61

And yet it does.  Incorrectly.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

gkawa

Quote from: Kindred on May 26, 2024, 08:24:49 AMAgain... nothing should be hitting cron.php diesctly.
That's not a url that Google should ever hit.
That's what I'm saying. Googlebot is doing exactly that, hitting cron.php directly.

By the way, and I don't think it's a surprise, Googlebot is completely ignoring robots.txt.
QuoteDisallow: /chron.php*
Disallow: /forum/chron.php*

From Google's article about robots.txt
Quoterobots.txt rules may not be supported by all search engines.
The instructions in robots.txt files cannot enforce crawler behavior to your site; it's up to the crawler to obey them. While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not. Therefore, if you want to keep information secure from web crawlers, it's better to use other blocking methods, such as password-protecting private files on your server.

shawnb61

You have a typo there.  cron, not chron.

But it will still invoke it at root.


Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

gkawa

Quote from: shawnb61 on May 26, 2024, 08:45:02 AMYou have a typo there.  cron, not chron.

But it will still invoke it at root.
:-[ You're right!
I wrote what I was used to writing when I think about "chron".
Maybe it'll work now... even when Google keeps ignoring all the other disallows...  ;D

Thanks. Good catch!

gkawa

It looks like it's working. It doesn't explain why Google is trying to invoke cron.php with made-up timestamps.

Now I found out that Meta/Facebook is doing the same with the calendar!

Quote173.252.107.5   
/forum/index.php?action=calendar;viewweek;year=2026;month=2;day=15 6/16/24, 9:18 AM
173.252.83.11 /forum/index.php?action=calendar;viewmonth;year=2023;month=4;day=8 6/16/24, 9:17 AM
173.252.83.23 /forum/index.php?action=calendar;sa=ical;eventid=37;abcba8edd48=9864f96c9e8fd575b69015c4aafb5d9b 6/16/24, 9:17 AM
173.252.83.17 /forum/index.php?action=calendar;viewlist;year=2025;month=7;day=2 6/16/24, 9:17 AM
173.252.83.22 /forum/index.php?action=calendar;sa=ical;eventid=128;ed0f091f9=f1ebac138a9cb073d4c4d74a6d1677a2
6/16/24, 9:17 AM
173.252.70.4 /forum/index.php?action=calendar;viewmonth;year=2024;month=1;day=5 6/16/24, 9:17 AM
173.252.87.14 /forum/index.php?action=calendar;sa=ical;eventid=98;c54726f41f=931be3f1f36ee86efed7fda11b1889cc 6/16/24, 9:17 AM
173.252.83.5 /forum/index.php?action=calendar;viewlist;year=2025;month=10;day=5 6/16/24, 9:16 AM

Is this some kind of AI thing? I hope so... it would be a lot easier to lead the resistance against them!  ;D

shawnb61

The FB bot is a problem.  Ignores robots.txt, ignores sitemaps, crawls hyper aggressively.  Continuously and inefficiently.

I now block it.

More here:
https://www.simplemachines.org/community/index.php?topic=589057.0
https://www.simplemachines.org/community/index.php?msg=4175762

Plus they are lying about the purpose of the crawler.  They still claim it is only for thumbnails...
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

gkawa

Quote from: shawnb61 on June 16, 2024, 10:13:54 AMThe FB bot is a problem.  Ignores robots.txt, ignores sitemaps, crawls hyper aggressively.  Continuously and inefficiently.
I have FB blocked by IP. Those lines on my log are all 403. I wonder why they do it. It doesn't make any sense. At some point, they should take the hint and stop wasting resources. Not mine, I can understand if they don't care about that, but their own.

I also have a long list of user-agents blocked. All the Russians and those annoying/useless that no one knows/cares about. But FB is using "Python/3.10 aiohttp/3.9.3" today and yesterday was a different version. Even if I use a wildcard they may change the format later. Fortunately, I don't care about thumbnails. Our forum is a small one and FB is not our scene.

Advertisement: