News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

Plagued by FB bots

Started by Dave J, June 04, 2024, 04:41:17 AM

Previous topic - Next topic

jasland

Quote from: shawnb61 on June 04, 2024, 02:53:59 PMI think there's a typo in the .htaccess string above.  The carat ^ is a placeholder for the beginning of the string, so it will only match if the user agent starts with facebookexternalhit/1.1.

I use a different syntax:
BrowserMatchNoCase facebookexternalhit bad_bot
BrowserMatchNoCase claudebot bad_bot
<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

And every time I find a new bad bot I just add another row to the list.



This worked for me and I had the same problem.

but how do I see the user agent?, that is, where do I get it from, as in this case it is facebookexternalhit in the case of another bot
thank you

Dave J

Quote from: shawnb61 on June 05, 2024, 02:11:43 PMTry deleting lines 28-33


OK that's done. I have to admit that there have only been 2 instances today where I have spotted FB bots on the site. There have been normal guests but that's fine.

Thanks for your help Shawn, I'll update this topic in the future if anything else happens.
If you want quizzes to add to the new SMF2.1 quiz mod go here . There are also walkthroughs in the forum to explain how to install them and other tips.

shawnb61

Quote from: Dave J on June 05, 2024, 05:19:18 PMOK that's done. I have to admit that there have only been 2 instances today where I have spotted FB bots on the site. There have been normal guests but that's fine.

I just did a quick test on your site and facebookexternalhit was not blocked.  I would talk with your host about placement of the .htaccess file & any possible restrictions there may be. 
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Dave J

Ok thanks again Shawn, I'll look at it again tomorrow
If you want quizzes to add to the new SMF2.1 quiz mod go here . There are also walkthroughs in the forum to explain how to install them and other tips.

shawnb61

#24
Quote from: jasland on June 05, 2024, 05:16:32 PMbut how do I see the user agent?, that is, where do I get it from, as in this case it is facebookexternalhit in the case of another bot
You need to look closely at your web access logs, which would be provided by your host. 

I pick a day that looks suspect, & load it into Excel, for easy filtering & analysis.  For the two hosts I've used, a space delimiter has been used, so Excel reads it as if it were a .csv quite nicely.  (Note that Excel expects double quotes to be escaped as "", not \", so I sometimes have to manipulate the file for a clean import.)

I have a couple Excel formulas that categorize the user agent & the request so I can see who is looking at what...

This is what the raw access log looks like in Excel.  The 4 columns to the right, Date, Hour, Request & Agent, are formulas I use.  You can see that the raw useragent column, J, actually has a bunch of stuff in it.  "facebookexternalhit" is a substring of the useragent:
You cannot view this attachment.

This is a bot-over-time analysis, by hour, to see who is flooding the site:
You cannot view this attachment.

With the Request & Agent categorized, I can see who is looking at what.  This helps me ID bad guys.  It also helps me fine-tune robots.txt - if I see all the bots doing something they shouldn't (e.g., signons & registrations...), I add that to robots.txt.  Over time, suspicious activity becomes pretty plain to see.  Also, this drastically reduces errors on Google Search Console...  If Google is looking at things it shouldn't, tell it so by refining robots.txt.  Google Search Console looks much cleaner over time:
You cannot view this attachment.

Finally, I double-check my user activity.  If I think they're not a bot, they must be a user.  I will run the most active IPs in a SMF admin member search, to confirm that what my Excel formula has dubbed a "user" is in fact a user:
You cannot view this attachment.


Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

shawnb61

For Excel junkies, here are my formulas...

To categorize requests:
=IF(ISNUMBER(SEARCH("area=alerts_popup",[@request])),"Alerts",
IF(ISNUMBER(SEARCH("type=rss",[@request])),"RSS",
IF(ISNUMBER(SEARCH("action=admin",[@request])),"admin",
IF(ISNUMBER(SEARCH("action=keepalive",[@request])),"Keepalive",
IF(ISNUMBER(SEARCH("action=printpage",[@request])),"Print",
IF(ISNUMBER(SEARCH("action=recent",[@request])),"Recent",
IF(ISNUMBER(SEARCH("action=unread",[@request])),"Unread",
IF(ISNUMBER(SEARCH("action=likes",[@request])),"Likes",
IF(ISNUMBER(SEARCH("action=dlattach",[@request])),"Attach",
IF(ISNUMBER(SEARCH("action=quotefast",[@request])),"Quote",
IF(ISNUMBER(SEARCH("action=markasread",[@request])),"MarkRead",
IF(ISNUMBER(SEARCH("action=quickmod2",[@request])),"Modify",
IF(ISNUMBER(SEARCH("action=profile",[@request])),"Profile",
IF(ISNUMBER(SEARCH("action=pm",[@request])),"PM",
IF(ISNUMBER(SEARCH("action=xml",[@request])),"xml",
IF(ISNUMBER(SEARCH("action=.xml",[@request])),"xml",
IF(ISNUMBER(SEARCH("action=attbr",[@request])),"Attachment Browser",
IF(ISNUMBER(SEARCH("action=search",[@request])),"Search",
IF(ISNUMBER(SEARCH("action=signup",[@request])),"Signup",
IF(ISNUMBER(SEARCH("action=register",[@request])),"Signup",
IF(ISNUMBER(SEARCH("action=join",[@request])),"Signup",
IF(ISNUMBER(SEARCH("action=login",[@request])),"Login",
IF(ISNUMBER(SEARCH("action=logout",[@request])),"Logout",
IF(ISNUMBER(SEARCH("action=verificationcode",[@request])),"Login",
IF(ISNUMBER(SEARCH(".msg",[@request])),"Message",
IF(ISNUMBER(SEARCH("msg=",[@request])),"Message",
IF(ISNUMBER(SEARCH("topic=",[@request])),"Topic",
IF(ISNUMBER(SEARCH("board=",[@request])),"Board",
IF(ISNUMBER(SEARCH(";wwwRedirect",[@request])),"Redirect",
IF(ISNUMBER(SEARCH("/smf/custom_avatar",[@request])),"Avatar",
IF(ISNUMBER(SEARCH("/smf/cron.php?ts=",[@request])),"Cron",
IF(ISNUMBER(SEARCH("/smf/index.php ",[@request])),"Board Index",
IF(ISNUMBER(SEARCH("/smf/proxy.php",[@request])),"Proxy",
IF(ISNUMBER(SEARCH("/smf/avatars",[@request])),"Avatar",
IF(ISNUMBER(SEARCH("/smf/Smileys",[@request])),"Smileys",
IF(ISNUMBER(SEARCH("/smf/Themes",[@request])),"Theme",
IF(ISNUMBER(SEARCH("/favicon.ico",[@request])),"Favicon",
IF(ISNUMBER(SEARCH("/robots.txt",[@request])),"robots.txt",
IF(ISNUMBER(SEARCH("/sitemap",[@request])),"Sitemap",
IF(ISNUMBER(SEARCH("/phpmyadmin",[@request])),"Admin",
IF(ISNUMBER(SEARCH("/admin",[@request])),"Admin",
IF(ISNUMBER(SEARCH("/pma",[@request])),"Admin",
"Other"))))))))))))))))))))))))))))))))))))))))))

To categorize user agents:
=IF(ISNUMBER(SEARCH("2ip bot",[@useragent])),"2ip bot",
IF(ISNUMBER(SEARCH("360Spider",[@useragent])),"360Spider",
IF(ISNUMBER(SEARCH("AdsBot-Google",[@useragent])),"AdsBot-Google",
IF(ISNUMBER(SEARCH("AhrefsBot",[@useragent])),"AhrefsBot",
IF(ISNUMBER(SEARCH("Awario",[@useragent])),"Awario",
IF(ISNUMBER(SEARCH("amazonbot",[@useragent])),"amazonbot",
IF(ISNUMBER(SEARCH("applebot",[@useragent])),"applebot",
IF(ISNUMBER(SEARCH("BaiduSpider",[@useragent])),"BaiduSpider",
IF(ISNUMBER(SEARCH("bingbot",[@useragent])),"bingbot",
IF(ISNUMBER(SEARCH("BLEXBot",[@useragent])),"BLEXBot",
IF(ISNUMBER(SEARCH("Bytespider",[@useragent])),"Bytespider",
IF(ISNUMBER(SEARCH("Cincraw",[@useragent])),"Cincraw",
IF(ISNUMBER(SEARCH("claudebot",[@useragent])),"claudebot",
IF(ISNUMBER(SEARCH("coccocbot",[@useragent])),"coccocbot",
IF(ISNUMBER(SEARCH("commoncrawl",[@useragent])),"commoncrawl",
IF(ISNUMBER(SEARCH("dataforseo-bot",[@useragent])),"dataforseo-bot",
IF(ISNUMBER(SEARCH("Discordbot",[@useragent])),"Discordbot",
IF(ISNUMBER(SEARCH("DomainStatsBot",[@useragent])),"DomainStatsBot",
IF(ISNUMBER(SEARCH("DotBot",[@useragent])),"DotBot",
IF(ISNUMBER(SEARCH("duckduckbot",[@useragent])),"duckduckbot",
IF(ISNUMBER(SEARCH("DuckDuckGo-Favicons-Bot",[@useragent])),"DuckDuckGo-Favicons-Bot",
IF(ISNUMBER(SEARCH("facebookexternalhit",[@useragent])),"facebookexternalhit",
IF(ISNUMBER(SEARCH("FAST-WebCrawler",[@useragent])),"FAST-WebCrawler",
IF(ISNUMBER(SEARCH("Gaisbot",[@useragent])),"Gaisbot",
IF(ISNUMBER(SEARCH("Googlebot",[@useragent])),"Googlebot",
IF(ISNUMBER(SEARCH("GoogleOther",[@useragent])),"GoogleOther",
IF(ISNUMBER(SEARCH("google.com/bot",[@useragent])),"google.com/bot",
IF(ISNUMBER(SEARCH("iAskBot",[@useragent])),"iAskBot",
IF(ISNUMBER(SEARCH("keys-so-bot",[@useragent])),"keys-so-bot",
IF(ISNUMBER(SEARCH("MixrankBot",[@useragent])),"MixrankBot",
IF(ISNUMBER(SEARCH("mj12bot",[@useragent])),"mj12bot",
IF(ISNUMBER(SEARCH("MojeekBot",[@useragent])),"MojeekBot",
IF(ISNUMBER(SEARCH("msnbot",[@useragent])),"msnbot",
IF(ISNUMBER(SEARCH("openai.com/bot",[@useragent])),"openai.com/bot",
IF(ISNUMBER(SEARCH("petalbot",[@useragent])),"petalbot",
IF(ISNUMBER(SEARCH("PiBot",[@useragent])),"PiBot",
IF(ISNUMBER(SEARCH("Pinterestbot",[@useragent])),"Pinterestbot",
IF(ISNUMBER(SEARCH("redditbot",[@useragent])),"redditbot",
IF(ISNUMBER(SEARCH("RU_Bot",[@useragent])),"RU_Bot",
IF(ISNUMBER(SEARCH("Screaming Frog SEO Spider",[@useragent])),"Screaming Frog SEO Spider",
IF(ISNUMBER(SEARCH("SeekportBot",[@useragent])),"SeekportBot",
IF(ISNUMBER(SEARCH("SemrushBot",[@useragent])),"SemrushBot",
IF(ISNUMBER(SEARCH("seznambot",[@useragent])),"seznambot",
IF(ISNUMBER(SEARCH("SiteLockSpider",[@useragent])),"SiteLockSpider",
IF(ISNUMBER(SEARCH("Slack-ImgProxy",[@useragent])),"Slack-ImgProxy",
IF(ISNUMBER(SEARCH("Sogou",[@useragent])),"Sogou",
IF(ISNUMBER(SEARCH("startmebot",[@useragent])),"startmebot",
IF(ISNUMBER(SEARCH("SuperBot",[@useragent])),"SuperBot",
IF(ISNUMBER(SEARCH("TelegramBot",[@useragent])),"TelegramBot",
IF(ISNUMBER(SEARCH("trendictionbot",[@useragent])),"trendictionbot",
IF(ISNUMBER(SEARCH("Twitterbot",[@useragent])),"Twitterbot",
IF(ISNUMBER(SEARCH("TurnitinBot",[@useragent])),"TurnitinBot",
IF(ISNUMBER(SEARCH("WellKnownBot",[@useragent])),"WellKnownBot",
IF(ISNUMBER(SEARCH("WireReaderBot",[@useragent])),"WireReaderBot",
IF(ISNUMBER(SEARCH("wpbot",[@useragent])),"wpbot",
IF(ISNUMBER(SEARCH("yacybot",[@useragent])),"yacybot",
IF(ISNUMBER(SEARCH("yandex",[@useragent])),"yandex",
IF(ISNUMBER(SEARCH("YisouSpider",[@useragent])),"YisouSpider",
IF(ISNUMBER(SEARCH("ZoomBot",[@useragent])),"ZoomBot",
IF(ISNUMBER(SEARCH("zoominfobot",[@useragent])),"zoominfobot",
IF(ISNUMBER(SEARCH("spider",[@useragent])),"Other bot",
IF(ISNUMBER(SEARCH("bot",[@useragent])),"Other bot",
IF(ISNUMBER(SEARCH("crawl",[@useragent])),"Other bot",
"User")))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

jasland

Thank you very much shawnb61, it has helped me a lot

greetings

a10

Dept. of anecdotes. This guy amassed a list of 4500 bots, "BrowserMatchNoCase" type for .htaccess
https://www.jeromeweb.net/seo/25518-htaccess-bloquer-mauvais-bots

I downloaded the .zip, added a few more known agents and put it in my .htaccess (= 179kb!) ...to see what would happen (break :O) in an overkill scenario. 1st impression, daily pageviews went drastically down, seemingly no adverse effect on forum \ members\ guests \ speed \ normal activity etc. Have fun.
2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

shawnb61

I started with this list:
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/_generator_lists/bad-user-agents.list

Added recent ones I know are bad (claudebot, googleother, facebookexternalhit, others...). 

I actually removed a few from that list, since I really do want actual search engines to be able to crawl (mj12bot, sogou, others...). 

I have 664 bots in my list currently.

Site CPU has been drastically reduced.  Performance is great.  Search rankings are excellent on Google & Bing.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Dave J

A thought has occurred to me.

Would having a sub forum on my site with a .htaccess file in it make any difference? My test site is in a folder within my main site.

I have uninstalled the .htaccess file from the test site and will monitor things. I have also done what A10 did (thanks for the info) and downloaded the zip file and added that to the main sites .htaccess file.

I might move the test site.
If you want quizzes to add to the new SMF2.1 quiz mod go here . There are also walkthroughs in the forum to explain how to install them and other tips.

shawnb61

I don't think the test site has anything to do with it. 

I would consult your host on location of .htaccess, & placement of the directives.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Dave J

Hi Shawn,

Yes the site is down as I initially went with cloudflare earlier today to see if that would help but then found out I couldn't connect via ftp, so now I've reverted back to the original DNS, so now I have to wait for it.

I spoke to the host who suggested cloudflare but could not come up with any other options.

I have download the zip file that A10 suggested and I've added that to the .htaccess. Once the site is back I'll see if it makes any difference
If you want quizzes to add to the new SMF2.1 quiz mod go here . There are also walkthroughs in the forum to explain how to install them and other tips.

shawnb61

We have learned that blocking "facebookexternalhit" via .htaccess causes previews on Facebook to show a "403 Forbidden" thumbnail under some circumstances.  Note that the links & thumbnails work fine, but the thumbnail is misleading. 

It's a bit of a quandry, because FB generates site traffic for us & leads new users over who want more depth than they can get on FB.

Some links show the bogus thumbnail, & some links don't.  Board & home page links work fine, but the thumbnail is very plain.  (Personally, I like that - a very clean text-based thumbnail...)  But topic & message links show a bogus 403 Forbidden thumbnail. 

I noticed that there are 3 flavors of "facebookexternalhit" user-agents.  One hits my site ~20K times each day (even though blocked, it still keeps trying...).  The other two are the "facebot twitterbot" user agents, that also include "facebookexternalhit", that only hit my site ~20 times each day.  All FB IPs.  So...

I changed my .htaccess restriction from "facebookexternalhit" to "facebook.com/externalhit", which is unique to the 20K/day user agent pounding my site. 

There have been improvements, mainly, that many topic links (not all...) work fine now.  Also, thumbnails have returned.  Message links over on FB still show 403 - unless you remove the thumbnail.

Note that when posting on FB, if you see the 403 thumbnail, you can simply remove the thumbnail.  That's what I've instructed folks to do for now. 

I wish there were a clean way to ID the FB crawler vs the thumbnail generator. 
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Dave J

Hi Shawn,

Thanks for the info. I've now updated my list to show 'facebook.com/externalhit' even with the old listing there haven't been anymore hits in the last few days
If you want quizzes to add to the new SMF2.1 quiz mod go here . There are also walkthroughs in the forum to explain how to install them and other tips.

Dave J

I thought I'd update you all on the curent status.

No hits from facebook at all or any other nasties, so thank you all who contributed and a special thansk to A10 for the link to the zip file, that's the list I've been using since earlier this week.

Only one issue with a member accessing the site from his mobile phone, but that's all.

I'm going to mark this solved now
If you want quizzes to add to the new SMF2.1 quiz mod go here . There are also walkthroughs in the forum to explain how to install them and other tips.

a10

Quote from: Dave J on June 15, 2024, 08:02:35 AMOnly one issue with a member accessing the site from his mobile phone, but that's all.

Yes, as mentinonned in my post above:  what would happen (break) :O) , so a 403 happened to 2 members of my forum, phone\android, periodically. Apart from that, with such a massive number of htaccess entries it was miraculously error-free.

So, in that megalist there may be 1 line or so that should not be there. Good luck finding out which one :O) :O)

As from now, I give up and let those damn bots do whatever they want (host does not seem to care, speed is always good), only following up with some periodical htaccess entries when particular things seems to go overboard.

Am guessing with AI, taking over the world & humnanity, those scrapers will increase endlessly and innondate everything until the end of times.


2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

shawnb61

My CPU results before & after blocking "facebook.com/externalhit":
You cannot view this attachment.

Yes, there are still kinda random issues with thumbnails on FB when links are posted.  Lots look OK, lots show an erroneous 403 error.  The links do work, so I have instructed users to just remove the thumbnails whenever they look like that.

Two different board links on Facebook, both boards with guest permissions.  I often get nice little image thumbnails, but not at all today:
You cannot view this attachment.
You cannot view this attachment.

If anybody ever figures out the proper way to allow thumbnail generation & block the crawler, please share.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

a10

Just a note to confirm my dwindling hope of ever controlling things, juat as some of the usual bot hits somehow diminished a little, then came a massive flood from Chinanet, China Unicom & more, endless nr's of ip\ranges.

2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

Advertisement: