Search Engines

Started by mickjav, September 17, 2022, 07:36:59 AM

Previous topic - Next topic

mickjav

Not sure where to put this

My Music and Charts site is getting over 1 million page views a month but I would say more than 90% are search engines, I did see a mod that might work but after reading some of the latest topics I felt it still had problems with 2.1.2.

is there anything out there that can keep selected search engines of my site That's per forum which I have 3 plus a test system or my site as a whole?

I've banned the ones I don't want but the system seems to allow them to view a page register the view then display the login/Message

I do remember reading somewhere when 2.1.xx was being developed that a 404 would be given for bans but might have remembered wrong??

-Rock Lee-

I also remember reading something but I don't know to what extent it is possible to filter that way, many times they are not search engines but people. It happened to me in several places, apparently I attend to a bad handling of the proxy or a way of working on selling connections from several companies, they concentrate the IPs. Blocking via .htaccess the IP range associated with search engines (by law they must define it according to what I understand) do they still enter? Or am I not fully understanding? ???


Regards!
¡Regresando como cual Fenix! ~ Bomber Code
Ayudas - Aportes - Tutoriales - Y mucho mas!!!

mickjav

Search engines Are getting top be a real problem, I know I can use .htaccess But for me thats beyond my understanding and I have to much on my plate to take on learning what I would there.

I know we have the following data Available

Search Engines
Ban list

I'm wondering if a option could be added to the search engines list to all attempted page views by a tagged search engine would receive a 404 instead of letting them run scripts to the point of marking views and getting on the online list, Same could apply to the ban list.

I understand that it may block a few as you say @-Rock Lee- are not search engines but these may be web scrappers which I don't want either.

Anybody else has the option to contact me via my admin email, The point is I would have the option to allow access but at the moment the search engines can do what they like maybe I'll have to put a hold on my current project(s) so I can sort it out as it's p**sing me off.

All the best mick




-Rock Lee-

Well it's not totally complete because it will take time, I never thought of it that way. Investigating a little, only google has several IPs, it should be blocked as the list of declared IP of the bot is known, but if it does not have an API to check (to automate) it must be done manually. The most simplified I found was Google IP ranges bot and robot whitelist it would only be a matter of testing because certain ranges are dynamic so it could block access to your site.


Regards!
¡Regresando como cual Fenix! ~ Bomber Code
Ayudas - Aportes - Tutoriales - Y mucho mas!!!

vbgamer45

I use something like t his on my sites
<Location />
<Limit GET POST PUT>

# Begin Bad Bot Blocking
BrowserMatchNoCase OmniExplorer_Bot/6.11.1 bad_bot
BrowserMatchNoCase omniexplorer_bot bad_bot
BrowserMatchNoCase Baiduspider bad_bot
BrowserMatchNoCase Baiduspider/2.0 bad_bot
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase yandeximages bad_bot
BrowserMatchNoCase Spinn3r bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase Sogouwebspider/3.0 bad_bot
BrowserMatchNoCase Sogouwebspider/4.0 bad_bot
BrowserMatchNoCase sosospider+ bad_bot
BrowserMatchNoCase jikespider bad_bot
BrowserMatchNoCase ia_archiver bad_bot
BrowserMatchNoCase PaperLiBot bad_bot
BrowserMatchNoCase ahrefsbot bad_bot
BrowserMatchNoCase ahrefsbot/1.0 bad_bot
BrowserMatchNoCase SiteBot/0.1 bad_bot
BrowserMatchNoCase DNS-Digger/1.0 bad_bot
BrowserMatchNoCase DNS-Digger-Explorer/1.0 bad_bot
BrowserMatchNoCase boardreader bad_bot
BrowserMatchNoCase radian6 bad_bot
BrowserMatchNoCase R6_FeedFetcher bad_bot
BrowserMatchNoCase R6_CommentReader bad_bot
BrowserMatchNoCase ScoutJet bad_bot
BrowserMatchNoCase ezooms bad_bot
BrowserMatchNoCase CC-rget/5.818 bad_bot
BrowserMatchNoCase libwww-perl/5.813 bad_bot
BrowserMatchNoCase magpie-crawler 1.1 bad_bot
BrowserMatchNoCase jakarta bad_bot
BrowserMatchNoCase discobot/1.0 bad_bot
BrowserMatchNoCase MJ12bot bad_bot
BrowserMatchNoCase MJ12bot/v1.2.0 bad_bot
BrowserMatchNoCase MJ12bot/v1.2.5 bad_bot
BrowserMatchNoCase SemrushBot/0.9 bad_bot
BrowserMatchNoCase MLBot bad_bot
BrowserMatchNoCase butterfly bad_bot
BrowserMatchNoCase SeznamBot/3.0 bad_bot
BrowserMatchNoCase HuaweiSymantecSpider bad_bot
BrowserMatchNoCase Exabot/2.0 bad_bot
BrowserMatchNoCase netseer/0.1 bad_bot
BrowserMatchNoCase NetSeer crawler/2.0 bad_bot
BrowserMatchNoCase NetSeer/Nutch-0.9 bad_bot
BrowserMatchNoCase psbot/0.1 bad_bot
BrowserMatchNoCase moreoverbot/5.0 bad_bot
BrowserMatchNoCase Jakarta Commons-HttpClient/3.0 bad_bot
BrowserMatchNoCase SocialSpider-Finder/0.2 bad_bot
BrowserMatchNoCase wordpress bad_bot
BrowserMatchNoCase istellabot bad_bot
BrowserMatchNoCase SeznamBot bad_bot
BrowserMatchNoCase Cliqzbot bad_bot
BrowserMatchNoCase SocialRankIOBot bad_bot
BrowserMatchNoCase Mail.RU_Bot bad_bot
BrowserMatchNoCase Clickag Bot bad_bot
BrowserMatchNoCase Mediatoolkitbot bad_bot
BrowserMatchNoCase SemrushBot bad_bot
BrowserMatchNoCase DotBot/1.1 bad_bot
BrowserMatchNoCase BLEXBot/1.0 bad_bot
BrowserMatchNoCase trendictionbot0.5.0 bad_bot
BrowserMatchNoCase SearchmetricsBot bad_bot
BrowserMatchNoCase SurdotlyBot/1.0 bad_bot
BrowserMatchNoCase DnyzBot/1.0 bad_bot
BrowserMatchNoCase coccocbot-web/1.0 bad_bot
BrowserMatchNoCase DataForSeoBot bad_bot
BrowserMatchNoCase www.timpi.io bad_bot
BrowserMatchNoCase DotBot bad_bot

<RequireAll>
Require all granted

<RequireNone>
Require env bad_bot
</RequireNone>
</RequireAll>

</Limit>
Community Suite for SMF - Take your forum to the next level built for SMF, Gallery,Store,Classifieds,Downloads,more!

SMFHacks.com -  Paid Modifications for SMF

Mods:
EzPortal - Portal System for SMF
SMF Gallery Pro
SMF Store SMF Classifieds Ad Seller Pro

mickjav

Quote from: vbgamer45 on September 19, 2022, 12:41:26 AMI use something like t his on my sites
<Location />
<Limit GET POST PUT>

# Begin Bad Bot Blocking
BrowserMatchNoCase OmniExplorer_Bot/6.11.1 bad_bot
BrowserMatchNoCase omniexplorer_bot bad_bot
BrowserMatchNoCase Baiduspider bad_bot
BrowserMatchNoCase Baiduspider/2.0 bad_bot
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase yandeximages bad_bot
BrowserMatchNoCase Spinn3r bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase Sogouwebspider/3.0 bad_bot
BrowserMatchNoCase Sogouwebspider/4.0 bad_bot
BrowserMatchNoCase sosospider+ bad_bot
BrowserMatchNoCase jikespider bad_bot
BrowserMatchNoCase ia_archiver bad_bot
BrowserMatchNoCase PaperLiBot bad_bot
BrowserMatchNoCase ahrefsbot bad_bot
BrowserMatchNoCase ahrefsbot/1.0 bad_bot
BrowserMatchNoCase SiteBot/0.1 bad_bot
BrowserMatchNoCase DNS-Digger/1.0 bad_bot
BrowserMatchNoCase DNS-Digger-Explorer/1.0 bad_bot
BrowserMatchNoCase boardreader bad_bot
BrowserMatchNoCase radian6 bad_bot
BrowserMatchNoCase R6_FeedFetcher bad_bot
BrowserMatchNoCase R6_CommentReader bad_bot
BrowserMatchNoCase ScoutJet bad_bot
BrowserMatchNoCase ezooms bad_bot
BrowserMatchNoCase CC-rget/5.818 bad_bot
BrowserMatchNoCase libwww-perl/5.813 bad_bot
BrowserMatchNoCase magpie-crawler 1.1 bad_bot
BrowserMatchNoCase jakarta bad_bot
BrowserMatchNoCase discobot/1.0 bad_bot
BrowserMatchNoCase MJ12bot bad_bot
BrowserMatchNoCase MJ12bot/v1.2.0 bad_bot
BrowserMatchNoCase MJ12bot/v1.2.5 bad_bot
BrowserMatchNoCase SemrushBot/0.9 bad_bot
BrowserMatchNoCase MLBot bad_bot
BrowserMatchNoCase butterfly bad_bot
BrowserMatchNoCase SeznamBot/3.0 bad_bot
BrowserMatchNoCase HuaweiSymantecSpider bad_bot
BrowserMatchNoCase Exabot/2.0 bad_bot
BrowserMatchNoCase netseer/0.1 bad_bot
BrowserMatchNoCase NetSeer crawler/2.0 bad_bot
BrowserMatchNoCase NetSeer/Nutch-0.9 bad_bot
BrowserMatchNoCase psbot/0.1 bad_bot
BrowserMatchNoCase moreoverbot/5.0 bad_bot
BrowserMatchNoCase Jakarta Commons-HttpClient/3.0 bad_bot
BrowserMatchNoCase SocialSpider-Finder/0.2 bad_bot
BrowserMatchNoCase wordpress bad_bot
BrowserMatchNoCase istellabot bad_bot
BrowserMatchNoCase SeznamBot bad_bot
BrowserMatchNoCase Cliqzbot bad_bot
BrowserMatchNoCase SocialRankIOBot bad_bot
BrowserMatchNoCase Mail.RU_Bot bad_bot
BrowserMatchNoCase Clickag Bot bad_bot
BrowserMatchNoCase Mediatoolkitbot bad_bot
BrowserMatchNoCase SemrushBot bad_bot
BrowserMatchNoCase DotBot/1.1 bad_bot
BrowserMatchNoCase BLEXBot/1.0 bad_bot
BrowserMatchNoCase trendictionbot0.5.0 bad_bot
BrowserMatchNoCase SearchmetricsBot bad_bot
BrowserMatchNoCase SurdotlyBot/1.0 bad_bot
BrowserMatchNoCase DnyzBot/1.0 bad_bot
BrowserMatchNoCase coccocbot-web/1.0 bad_bot
BrowserMatchNoCase DataForSeoBot bad_bot
BrowserMatchNoCase www.timpi.io bad_bot
BrowserMatchNoCase DotBot bad_bot

<RequireAll>
Require all granted

<RequireNone>
Require env bad_bot
</RequireNone>
</RequireAll>

</Limit>

Thanks @vbgamer45 Would that be added to my main .htaccess file?

All the best mick

vbgamer45

Yes. I think so but haven't tried not sure if location tags will work.
I have mine in the apache config.
Community Suite for SMF - Take your forum to the next level built for SMF, Gallery,Store,Classifieds,Downloads,more!

SMFHacks.com -  Paid Modifications for SMF

Mods:
EzPortal - Portal System for SMF
SMF Gallery Pro
SMF Store SMF Classifieds Ad Seller Pro

mickjav

Been looking around for a List of bad bots I could use with my htaccess file as not sure about updating with above list many thanks to @vbgamer45 for it, When I get more confident with it all I might try it to see if it works with htaccess.

I've found this very large list which looks like it contains most of the bad bot on the planet lol

https://gist.github.com/dvlop/fca36213ad6237891609e1e038a3bbc1

vbgamer45

Yeah I did my own just since I wanted to hand pick and not block any good bots.
Community Suite for SMF - Take your forum to the next level built for SMF, Gallery,Store,Classifieds,Downloads,more!

SMFHacks.com -  Paid Modifications for SMF

Mods:
EzPortal - Portal System for SMF
SMF Gallery Pro
SMF Store SMF Classifieds Ad Seller Pro

mickjav

#9
I should be able to construct my list from yours

# Start Bad Bot Prevention
<IfModule mod_setenvif.c>
SetEnvIfNoCase User-Agent "^OmniExplorer_Bot/6.11.1.*" bad_bot
SetEnvIfNoCase User-Agent "^omniexplorer_bot.*" bad_bot
SetEnvIfNoCase User-Agent "^BingBot.*" bad_bot
<Limit GET POST PUT>
  Order Allow,Deny
  Allow from all
  Deny from env=bad_bot
</Limit>
</IfModule>

So I could convert your list into a htaccess list??

mickjav

This is one of the worst offenders https://www.abuseipdb.com/check/185.191.171.35?page=8

I've got it banned IP range banned at the present But not sure what to use to identify it for blocking with htaccess

35.bl.bot ??

Kindred

yeah, .htaccess is better, because it intercepts the connection at the server level, before scripts are loaded.  The SMF Banlist is good --- but SMF still has to load the whole script in order to trigger the banlist check.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

mickjav

Have finnally sorted my htaccess bad bot list I had problems with @vbgamer45 list had to revert back to my backup.

I tried the list I Posted a link to in post 7, even though it seems a bit OTT it seems to work and I can start editing the massive list.

Had to add Bingbot to it :)

All the best mick

Advertisement: