How to block baidu spider?

Started by peps1, November 27, 2009, 12:59:37 AM

Previous topic - Next topic

peps1

I know baidu spider are a legitimate Chinese search engine bot, but i have no Chinese content or desire any Chinese traffic........but there are 100's of little buggers bleeding my limited bandwidth!

Only mod i can find to block themis for SMF2

Please Help!   

DavidCT

In my experience Baiduspider obeys robots.txt.  You can either deny it personally or globally, it seems to respect * wildcard.

robots.txt

#Baiduspider
User-agent: Baiduspider
Disallow: /

#Others
User-agent: *
Disallow: /


Some bots refuse to obey robots.txt, if they even bother to check it.  Those you can block through your htaccess file.

peps1

Thanks DavidCT, I just slap this in the root right?

Also will it only block Baiduspider, or every bot?

DavidCT

Yes, robots goes in the root, so you'd see it if you did yourdomain.com/robots.txt.

The * wildcard will block any bot who respects * and isn't specifically mentioned otherwise.  If you didn't define Googlebot, Slurp, MSNBOT, etc, and want those to crawl you need to allow them by removing the /...


User-agent: Googlebot
Disallow:

peps1

The baidu spider are still crawl the forum.....50 ever couple of hours.

Is there a way to just block the ip range 220.181.7.*** ?

DavidCT

#5
Quote from: peps1 on November 28, 2009, 11:33:40 AM
The baidu spider are still crawl the forum.....50 ever couple of hours.

Is there a way to just block the ip range 220.181.7.*** ?

220.x.x.x is NOT Baidu, it's a spammer (EDIT: harvestor, more than likely) pretending to be.  Baidu uses 119.63.192.0 - 119.63.199.255.

220.181.0.0 - 220.181.255.255 = CHINANET Beijing province network

Many ways through htaccess to block them, one way is:

<Files *.*>
        order allow,deny
        allow from all
        deny from 220.181.
</Files>


Doing this will block the entire range, no matter who is using it.  You can use RewriteCond with rules - if this range and this user-agent, don't allow, but I don't want to quote you a rule without testing it out first as I'm no pro with that stuff.

Find out more at APNIC

peps1

Thankyou for all your help!

but....umm....where do I put that code?  :-\

DavidCT

As I said in .htaccess, assuming you can use it.  If you don't have one yet, make one using notepad - it's just a text file.  Upload it to your root folder.  The "." period in front of it tells linux it's a hidden file.

peps1

Done :)

they are still showing up a guests, but seem restricted to "Viewing the board index of [My Site]".

Guess there is no way to stop them from even getting that far, and stop them showing up as guests?

DavidCT

#9
If you did it right they shouldn't get near your forum.

Like I said I'm no pro... I think I forgot a very important thing :(

Before that block, add this:

RewriteEngine On


I don't know if it's needed for that or just RewriteCond/Rule stuff.  Give it a try.

To test it, block yourself too.  Add another deny from line with your entire IP address.  You'll see your IP in WHO'S ONLINE.

EDIT:
Actually, no that line isn't needed for that.  Make sure it's spelled .htaccess and placed in your root folder.  You can place it in your forum folder, but it'll only protect that folder and anything beyond it.

j_jindal1

Nice piece of information DavidCT .. Thanks. I was looking for something similar.. :-)
www.ShayarFamily.com Shayri forum of Friends

clyde4210

Blocking IP's is not a good idea unless you are being hit by a spammer. You should use Rewrite rules for blocking Spiders/Bots. Below is what I put in my .htaccess.

RewriteCond %{HTTP_USER_AGENT} Baiduspider
RewriteRule ^.*$ http://127.0.0.1 [R,L]

OR
RewriteCond %{HTTP_USER_AGENT} Baidu
RewriteRule ^.*$ http://127.0.0.1 [R,L]



Arantor

That actually consumes (quite a bit) more processing power than a simple block does, meaning more resources are taken up dealing with Baidu than necessary.

Plus it doesn't show the forum as forbidden that way.

clyde4210

It will consume a lot less then constantly chasing down the bots IP's. Not to mention by blocking a range as stated previously, you would be blocking legit users as well. Thirdly when the bot switches ranges because they have more than 1 range of octets you would be sent blocking those.

Block it once with rewrite rules and be done with it. I don't use this CMS. I was searching about Badui.  I swore I have saw it before and found this post about it. Seen posts of how to block it and seen that it would be time consuming not to mention could take as much processing power with all the IP blocking.

Sorry if you disagree but in the end, the logical action should be taken using rewrite rules.

Arantor

Oh, disagreement is fine. There are good reasons to not do it as an IP block - and those are it.

But, there are problems with rewrite rules too, it's not a single master solution. Plus some users don't have mod_rewrite enabled, especially on shared hosting, or even run on Apache.

clyde4210

Those are valid reasons and I forgot about those. Now to the subject at hand. teaching People who do have rewrite rules on or, can have them on to block IP's. Is not such a good idea in my opinion. They will just start blocking IP's left and right when it comes to spiders. Thus blocking legit users and constantly editing of the .htaccess.

Now a days servers are on dual or quad core, processing is fairly fast compared to yesteryear. It would not consume much processing power and less than a 1/10 of a second to go thru the .htaccess even with the rewrite rules being used. Sites load time would not be affected.

Arantor

Considering that every single page load of the forum goes through .htaccess rules, which includes every image, every avatar (unless it's external or otherwise specially configured), every attachment, every Javascript file and so on, we're not talking one pass through .htaccess rules per page for the user. We're talking 20+ outside the thread display, many more inside a typical thread display.

On shared hosts, especially, where CPU is constrained by hundreds or even thousands of other sites on the same physical server, it's vital to keep it as lean as possible.

DavidCT

Baidu, the real one, obeys robots.txt's * wildcard (unlike Yandex, which I had to ban), so no reason to ban them.  It's the fake ones that are the problem.

When mentioning "banning legit users" - it's always possible.  I personally look at all possible bot accesses and do a WHOIS and if it's a hosting company, it gets banned - the whole range.  I also block Russia and China as it isn't likely legit users from there would visit my site and it stops a whole lot of abuse as those countries seem to have a big problem with spamming and harvesting.

My htaccess is getting huge - 200k now (including a lot of comments and blank lines), but in my opinion it uses less resources by blocking the abusers than allowing them to crawl my site unchecked, though I could be wrong, but I feel better blocking them.  I hate people making money off collecting my site's data :)

mbanusick2

please is there any harm in allowing those bots, is it only about the information...will it harm my site==i dont know much about this things

DavidCT

Quote from: mbanusick2 on December 20, 2009, 02:38:45 PM
please is there any harm in allowing those bots, is it only about the information...will it harm my site==i dont know much about this things

No worries, be happy.  Harvestors and scrapers just steal your content and bandwidth, no biggy.

The "nasty bots" you'll never be able to block as they use virus infected computers to do their dirty work ;)

Advertisement: