Advertisement:
Advertise With Us!

Author Topic: How to block baidu spider?  (Read 102119 times)

Offline peps1

  • Jr. Member
  • **
  • Posts: 262
How to block baidu spider?
« on: November 27, 2009, 12:59:37 AM »
I know baidu spider are a legitimate Chinese search engine bot, but i have no Chinese content or desire any Chinese traffic........but there are 100's of little buggers bleeding my limited bandwidth!

Only mod i can find to block themis for SMF2

Please Help!   

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #1 on: November 27, 2009, 10:50:54 AM »
In my experience Baiduspider obeys robots.txt.  You can either deny it personally or globally, it seems to respect * wildcard.

robots.txt
Code: [Select]
#Baiduspider
User-agent: Baiduspider
Disallow: /

#Others
User-agent: *
Disallow: /

Some bots refuse to obey robots.txt, if they even bother to check it.  Those you can block through your htaccess file.

Offline peps1

  • Jr. Member
  • **
  • Posts: 262
Re: How to block baidu spider?
« Reply #2 on: November 27, 2009, 11:57:22 PM »
Thanks DavidCT, I just slap this in the root right?

Also will it only block Baiduspider, or every bot?

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #3 on: November 28, 2009, 09:41:16 AM »
Yes, robots goes in the root, so you'd see it if you did yourdomain.com/robots.txt.

The * wildcard will block any bot who respects * and isn't specifically mentioned otherwise.  If you didn't define Googlebot, Slurp, MSNBOT, etc, and want those to crawl you need to allow them by removing the /...

Code: [Select]
User-agent: Googlebot
Disallow:

Offline peps1

  • Jr. Member
  • **
  • Posts: 262
Re: How to block baidu spider?
« Reply #4 on: November 28, 2009, 11:33:40 AM »
The baidu spider are still crawl the forum.....50 ever couple of hours.

Is there a way to just block the ip range 220.181.7.*** ?

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #5 on: November 28, 2009, 11:43:45 AM »
The baidu spider are still crawl the forum.....50 ever couple of hours.

Is there a way to just block the ip range 220.181.7.*** ?

220.x.x.x is NOT Baidu, it's a spammer (EDIT: harvestor, more than likely) pretending to be.  Baidu uses 119.63.192.0 - 119.63.199.255.

220.181.0.0 - 220.181.255.255 = CHINANET Beijing province network

Many ways through htaccess to block them, one way is:
Code: [Select]
<Files *.*>
        order allow,deny
        allow from all
        deny from 220.181.
</Files>

Doing this will block the entire range, no matter who is using it.  You can use RewriteCond with rules - if this range and this user-agent, don't allow, but I don't want to quote you a rule without testing it out first as I'm no pro with that stuff.

Find out more at APNIC
« Last Edit: November 28, 2009, 11:48:43 AM by DavidCT »

Offline peps1

  • Jr. Member
  • **
  • Posts: 262
Re: How to block baidu spider?
« Reply #6 on: November 28, 2009, 11:52:34 AM »
Thankyou for all your help!

but....umm....where do I put that code?  :-\

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #7 on: November 28, 2009, 12:16:54 PM »
As I said in .htaccess, assuming you can use it.  If you don't have one yet, make one using notepad - it's just a text file.  Upload it to your root folder.  The "." period in front of it tells linux it's a hidden file.

Offline peps1

  • Jr. Member
  • **
  • Posts: 262
Re: How to block baidu spider?
« Reply #8 on: November 28, 2009, 12:23:07 PM »
Done :)

they are still showing up a guests, but seem restricted to "Viewing the board index of [My Site]".

Guess there is no way to stop them from even getting that far, and stop them showing up as guests?

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #9 on: November 28, 2009, 12:28:02 PM »
If you did it right they shouldn't get near your forum.

Like I said I'm no pro... I think I forgot a very important thing :(

Before that block, add this:
Code: [Select]
RewriteEngine On

I don't know if it's needed for that or just RewriteCond/Rule stuff.  Give it a try.

To test it, block yourself too.  Add another deny from line with your entire IP address.  You'll see your IP in WHO'S ONLINE.

EDIT:
Actually, no that line isn't needed for that.  Make sure it's spelled .htaccess and placed in your root folder.  You can place it in your forum folder, but it'll only protect that folder and anything beyond it.
« Last Edit: November 28, 2009, 12:31:54 PM by DavidCT »

Offline j_jindal1

  • Semi-Newbie
  • *
  • Posts: 92
    • Shayri forum of Friends
Re: How to block baidu spider?
« Reply #10 on: December 02, 2009, 06:42:47 AM »
Nice piece of information DavidCT .. Thanks. I was looking for something similar.. :-)
www.ShayarFamily.com Shayri forum of Friends

Offline clyde4210

  • Newbie
  • *
  • Posts: 4
Re: How to block baidu spider?
« Reply #11 on: December 19, 2009, 10:24:59 AM »
Blocking IP's is not a good idea unless you are being hit by a spammer. You should use Rewrite rules for blocking Spiders/Bots. Below is what I put in my .htaccess.

Code: [Select]
RewriteCond %{HTTP_USER_AGENT} Baiduspider
RewriteRule ^.*$ http://127.0.0.1 [R,L]
OR
Code: [Select]
RewriteCond %{HTTP_USER_AGENT} Baidu
RewriteRule ^.*$ http://127.0.0.1 [R,L]


Offline Arantor

  • Resident Overthinker
  • SMF Friend
  • SMF Legend
  • *
  • Posts: 70,971
    • StoryBB/StoryBB on GitHub
Re: How to block baidu spider?
« Reply #12 on: December 19, 2009, 11:17:00 AM »
That actually consumes (quite a bit) more processing power than a simple block does, meaning more resources are taken up dealing with Baidu than necessary.

Plus it doesn't show the forum as forbidden that way.
Don’t try to tell me that some power can corrupt a person. You haven’t had enough to know what it’s like.

No good deed goes unpunished / No act of charity goes unresented.

Offline clyde4210

  • Newbie
  • *
  • Posts: 4
Re: How to block baidu spider?
« Reply #13 on: December 20, 2009, 08:21:06 AM »
It will consume a lot less then constantly chasing down the bots IP's. Not to mention by blocking a range as stated previously, you would be blocking legit users as well. Thirdly when the bot switches ranges because they have more than 1 range of octets you would be sent blocking those.

Block it once with rewrite rules and be done with it. I don't use this CMS. I was searching about Badui.  I swore I have saw it before and found this post about it. Seen posts of how to block it and seen that it would be time consuming not to mention could take as much processing power with all the IP blocking.

Sorry if you disagree but in the end, the logical action should be taken using rewrite rules.

Offline Arantor

  • Resident Overthinker
  • SMF Friend
  • SMF Legend
  • *
  • Posts: 70,971
    • StoryBB/StoryBB on GitHub
Re: How to block baidu spider?
« Reply #14 on: December 20, 2009, 08:22:55 AM »
Oh, disagreement is fine. There are good reasons to not do it as an IP block - and those are it.

But, there are problems with rewrite rules too, it's not a single master solution. Plus some users don't have mod_rewrite enabled, especially on shared hosting, or even run on Apache.
Don’t try to tell me that some power can corrupt a person. You haven’t had enough to know what it’s like.

No good deed goes unpunished / No act of charity goes unresented.

Offline clyde4210

  • Newbie
  • *
  • Posts: 4
Re: How to block baidu spider?
« Reply #15 on: December 20, 2009, 09:21:33 AM »
Those are valid reasons and I forgot about those. Now to the subject at hand. teaching People who do have rewrite rules on or, can have them on to block IP's. Is not such a good idea in my opinion. They will just start blocking IP's left and right when it comes to spiders. Thus blocking legit users and constantly editing of the .htaccess.

Now a days servers are on dual or quad core, processing is fairly fast compared to yesteryear. It would not consume much processing power and less than a 1/10 of a second to go thru the .htaccess even with the rewrite rules being used. Sites load time would not be affected.

Offline Arantor

  • Resident Overthinker
  • SMF Friend
  • SMF Legend
  • *
  • Posts: 70,971
    • StoryBB/StoryBB on GitHub
Re: How to block baidu spider?
« Reply #16 on: December 20, 2009, 09:28:03 AM »
Considering that every single page load of the forum goes through .htaccess rules, which includes every image, every avatar (unless it's external or otherwise specially configured), every attachment, every Javascript file and so on, we're not talking one pass through .htaccess rules per page for the user. We're talking 20+ outside the thread display, many more inside a typical thread display.

On shared hosts, especially, where CPU is constrained by hundreds or even thousands of other sites on the same physical server, it's vital to keep it as lean as possible.
Don’t try to tell me that some power can corrupt a person. You haven’t had enough to know what it’s like.

No good deed goes unpunished / No act of charity goes unresented.

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #17 on: December 20, 2009, 11:37:48 AM »
Baidu, the real one, obeys robots.txt's * wildcard (unlike Yandex, which I had to ban), so no reason to ban them.  It's the fake ones that are the problem.

When mentioning "banning legit users" - it's always possible.  I personally look at all possible bot accesses and do a WHOIS and if it's a hosting company, it gets banned - the whole range.  I also block Russia and China as it isn't likely legit users from there would visit my site and it stops a whole lot of abuse as those countries seem to have a big problem with spamming and harvesting.

My htaccess is getting huge - 200k now (including a lot of comments and blank lines), but in my opinion it uses less resources by blocking the abusers than allowing them to crawl my site unchecked, though I could be wrong, but I feel better blocking them.  I hate people making money off collecting my site's data :)

Offline mbanusick2

  • Jr. Member
  • **
  • Posts: 107
  • Gender: Male
  • www.kokoarena.com
    • KokoArena Forum
Re: How to block baidu spider?
« Reply #18 on: December 20, 2009, 02:38:45 PM »
please is there any harm in allowing those bots, is it only about the information...will it harm my site==i dont know much about this things

Offline DavidCT

  • Sophist Member
  • *****
  • Posts: 1,239
  • Gender: Male
  • $$$ This $pace For Rent $$$
    • Home Plate Network's Ballpark
Re: How to block baidu spider?
« Reply #19 on: December 20, 2009, 02:50:26 PM »
please is there any harm in allowing those bots, is it only about the information...will it harm my site==i dont know much about this things

No worries, be happy.  Harvestors and scrapers just steal your content and bandwidth, no biggy.

The "nasty bots" you'll never be able to block as they use virus infected computers to do their dirty work ;)