How to block baidu spider?

Started by peps1, November 27, 2009, 12:59:37 AM

Previous topic - Next topic

peps1

I know baidu spider are a legitimate Chinese search engine bot, but i have no Chinese content or desire any Chinese traffic........but there are 100's of little buggers bleeding my limited bandwidth!

Only mod i can find to block themis for SMF2

Please Help!   

DavidCT

In my experience Baiduspider obeys robots.txt.  You can either deny it personally or globally, it seems to respect * wildcard.

robots.txt

#Baiduspider
User-agent: Baiduspider
Disallow: /

#Others
User-agent: *
Disallow: /


Some bots refuse to obey robots.txt, if they even bother to check it.  Those you can block through your htaccess file.

peps1

Thanks DavidCT, I just slap this in the root right?

Also will it only block Baiduspider, or every bot?

DavidCT

Yes, robots goes in the root, so you'd see it if you did yourdomain.com/robots.txt.

The * wildcard will block any bot who respects * and isn't specifically mentioned otherwise.  If you didn't define Googlebot, Slurp, MSNBOT, etc, and want those to crawl you need to allow them by removing the /...


User-agent: Googlebot
Disallow:

peps1

The baidu spider are still crawl the forum.....50 ever couple of hours.

Is there a way to just block the ip range 220.181.7.*** ?

DavidCT

#5
Quote from: peps1 on November 28, 2009, 11:33:40 AM
The baidu spider are still crawl the forum.....50 ever couple of hours.

Is there a way to just block the ip range 220.181.7.*** ?

220.x.x.x is NOT Baidu, it's a spammer (EDIT: harvestor, more than likely) pretending to be.  Baidu uses 119.63.192.0 - 119.63.199.255.

220.181.0.0 - 220.181.255.255 = CHINANET Beijing province network

Many ways through htaccess to block them, one way is:

<Files *.*>
        order allow,deny
        allow from all
        deny from 220.181.
</Files>


Doing this will block the entire range, no matter who is using it.  You can use RewriteCond with rules - if this range and this user-agent, don't allow, but I don't want to quote you a rule without testing it out first as I'm no pro with that stuff.

Find out more at APNIC

peps1

Thankyou for all your help!

but....umm....where do I put that code?  :-\

DavidCT

As I said in .htaccess, assuming you can use it.  If you don't have one yet, make one using notepad - it's just a text file.  Upload it to your root folder.  The "." period in front of it tells linux it's a hidden file.

peps1

Done :)

they are still showing up a guests, but seem restricted to "Viewing the board index of [My Site]".

Guess there is no way to stop them from even getting that far, and stop them showing up as guests?

DavidCT

#9
If you did it right they shouldn't get near your forum.

Like I said I'm no pro... I think I forgot a very important thing :(

Before that block, add this:

RewriteEngine On


I don't know if it's needed for that or just RewriteCond/Rule stuff.  Give it a try.

To test it, block yourself too.  Add another deny from line with your entire IP address.  You'll see your IP in WHO'S ONLINE.

EDIT:
Actually, no that line isn't needed for that.  Make sure it's spelled .htaccess and placed in your root folder.  You can place it in your forum folder, but it'll only protect that folder and anything beyond it.

j_jindal1

Nice piece of information DavidCT .. Thanks. I was looking for something similar.. :-)
www.ShayarFamily.com Shayri forum of Friends

clyde4210

Blocking IP's is not a good idea unless you are being hit by a spammer. You should use Rewrite rules for blocking Spiders/Bots. Below is what I put in my .htaccess.

RewriteCond %{HTTP_USER_AGENT} Baiduspider
RewriteRule ^.*$ http://127.0.0.1 [R,L]

OR
RewriteCond %{HTTP_USER_AGENT} Baidu
RewriteRule ^.*$ http://127.0.0.1 [R,L]



Arantor

That actually consumes (quite a bit) more processing power than a simple block does, meaning more resources are taken up dealing with Baidu than necessary.

Plus it doesn't show the forum as forbidden that way.

clyde4210

It will consume a lot less then constantly chasing down the bots IP's. Not to mention by blocking a range as stated previously, you would be blocking legit users as well. Thirdly when the bot switches ranges because they have more than 1 range of octets you would be sent blocking those.

Block it once with rewrite rules and be done with it. I don't use this CMS. I was searching about Badui.  I swore I have saw it before and found this post about it. Seen posts of how to block it and seen that it would be time consuming not to mention could take as much processing power with all the IP blocking.

Sorry if you disagree but in the end, the logical action should be taken using rewrite rules.

Arantor

Oh, disagreement is fine. There are good reasons to not do it as an IP block - and those are it.

But, there are problems with rewrite rules too, it's not a single master solution. Plus some users don't have mod_rewrite enabled, especially on shared hosting, or even run on Apache.

clyde4210

Those are valid reasons and I forgot about those. Now to the subject at hand. teaching People who do have rewrite rules on or, can have them on to block IP's. Is not such a good idea in my opinion. They will just start blocking IP's left and right when it comes to spiders. Thus blocking legit users and constantly editing of the .htaccess.

Now a days servers are on dual or quad core, processing is fairly fast compared to yesteryear. It would not consume much processing power and less than a 1/10 of a second to go thru the .htaccess even with the rewrite rules being used. Sites load time would not be affected.

Arantor

Considering that every single page load of the forum goes through .htaccess rules, which includes every image, every avatar (unless it's external or otherwise specially configured), every attachment, every Javascript file and so on, we're not talking one pass through .htaccess rules per page for the user. We're talking 20+ outside the thread display, many more inside a typical thread display.

On shared hosts, especially, where CPU is constrained by hundreds or even thousands of other sites on the same physical server, it's vital to keep it as lean as possible.

DavidCT

Baidu, the real one, obeys robots.txt's * wildcard (unlike Yandex, which I had to ban), so no reason to ban them.  It's the fake ones that are the problem.

When mentioning "banning legit users" - it's always possible.  I personally look at all possible bot accesses and do a WHOIS and if it's a hosting company, it gets banned - the whole range.  I also block Russia and China as it isn't likely legit users from there would visit my site and it stops a whole lot of abuse as those countries seem to have a big problem with spamming and harvesting.

My htaccess is getting huge - 200k now (including a lot of comments and blank lines), but in my opinion it uses less resources by blocking the abusers than allowing them to crawl my site unchecked, though I could be wrong, but I feel better blocking them.  I hate people making money off collecting my site's data :)

mbanusick2

please is there any harm in allowing those bots, is it only about the information...will it harm my site==i dont know much about this things

DavidCT

Quote from: mbanusick2 on December 20, 2009, 02:38:45 PM
please is there any harm in allowing those bots, is it only about the information...will it harm my site==i dont know much about this things

No worries, be happy.  Harvestors and scrapers just steal your content and bandwidth, no biggy.

The "nasty bots" you'll never be able to block as they use virus infected computers to do their dirty work ;)

mbanusick2

what do u mean by stealing...do they also count as my site's visitors
so if i have a lot of bandwith to spare should i allow them...

Arantor

They also take up processing time which, on a shared host, means genuine users can't get to the forum properly.

mbanusick2

am sorry to bother you guys but how do i know a shared host...does that also mean that the bots are recorded as my guests and what do they do with those info they collect
Sorry To Bother You==
..Thanks a lot..

clyde4210

If your host does not allow rewrite then you could use.
SetEnvIfNoCase user-agent  "^baidu" bad_bot=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</FilesMatch>

While I don't agree with blocking hundreds of IP's for banning bots. It can be done that way. The time spent processing hundreds of IP's or a rewrite is about the same. I think this hxxp:www.webmasterworld.com/forum92/1145.htm [nonactive] post really sums it up. It all boils down to your servers processor and ram.

Grinch mentions the htaccess had to go thru every image and script. Well you can cache that using the htaccess,  which will speed up the processing time.
<IfModule mod_expires.c>
ExpiresActive On
ExpiresDefault A0

# Set up caching on media files for 1 year (forever?)
<FilesMatch "\.(ico|flv|pdf|mov|mp3|wmv|ppt)$">
  ExpiresDefault A29030400
  Header append Cache-Control "public"
</FilesMatch>

# Set up caching on media files for 1 week
<FilesMatch "\.(gif|jpg|jpeg|png|swf|bmp)$">
ExpiresDefault A604800
Header append Cache-Control "public"
</FilesMatch>

# Set up 2 Hour caching on commonly updated files
<FilesMatch "\.(xml|txt|html|js|css)$">
  ExpiresDefault A7200
  Header append Cache-Control "private, proxy-revalidate, must-revalidate"
</FilesMatch>
</IfModule>


I have NukeSentinel(tm) so I don't need htaccess as much for banning bots and or people.

midlandshunnies

I have just blocked a baidu spider - not sure if it was real all not using this code - different IP range - but seeing about 90 visits every 15 minutes or so so hopefully my site will see a load time improvement and less processing requirements!

<Files *.*>
        order allow,deny
        allow from all
        deny from 180.76.
</Files>

Martine M

I blocked 2 IP ranges and he was gone for a few days then came pack with another IP range,
Blocked that one to and now I think he is gone.
Running SMF 2.09 - Diego Andrés Theme Elegant Mind - TP 1.0 - Main Forum language English - Browser Firefox 33


Quexinos

Hey guys, sorry to bump this but I used the IP deny tool in Cpanel to block Baidu and a couple other search engines.  I just wasn't hapy with how often they came by and that seems to work.

Does using that take up a lot of resources?  It just writes it to .htaccess right?  I want to make sure I'm using as little resources as possible and I've only blocked like 5 IPs, so I should be okay right?

a10

QuoteI've only blocked like 5 IPs, so I should be okay right?
Yes. You can block a lot more if you need to. See this post.
2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

Igal-Incapsula

Hi
Please note that BaiduSpider can, and will, access your site from several different IP ranges - not only from 180.76...
Here are few others IPs, that Baidu will use.
125.39.78.0
123.125.66.15
220.181.7.13
119.63.193.0
and more.

You can verify Baidu spider IPs and/or user-agents via hxxp:botopedia.org [nonactive]

Martine M

Thanks for the url I'll bookmark it.
at this moment I successfully blocked Baïdu for a while now.
Running SMF 2.09 - Diego Andrés Theme Elegant Mind - TP 1.0 - Main Forum language English - Browser Firefox 33


Advertisement: