How do I reduce bandwidth usage?

Started by ajac63, February 18, 2021, 10:26:28 PM

Previous topic - Next topic

ajac63

Firstly, apologies if this is in the wrong section of SMF and also for not having posted for a while, I've had problems...  ::)  Basically, a fair amount of bandwidth has been used up for the current month for my forum, over 86%, and so I'm looking for ways to reduce further excessive bandwidth usage.  What things should I be doing?  I've discovered that spider and crawlers can use up bandwidth a lot and that a robots.txt file in the route directory can help with this, but I'm not that familiar with the syntax.  I've also discovered that turning on http compression can help, but don't know if SMF powered forums have this by default.

To stop more bandwidth being used up, I put my forum offline for maintenance about a day ago, but the usage has gone up to 89.45%. How can more bandwidth be used up when it's offline?

Thanks if anyone can help :)
Another SMF believer

shadav

if your users post a lot of images
this mod can help a bit by making the filesizes smaller
https://custom.simplemachines.org/mods/index.php?mod=4082

using a program to optimize all the images on your site itself can help
we're not talking much but in the end every little bit does help

you can sign up for cloudflare and use their services also to help save a bit

Antechinus

Sounds like it's the spiders that are causing a lot of the problem.* Banning the bad ones via .htacess is the way to go. That will nail the ones that ignore robots.txt. They'll still be able to ping your domain, but they won't get any bytes back (just a 403 error message).

The ,htaccess for the root directory of my domain looks like this:

#BlockBotsByUserAgent
SetEnvIfNoCase User-Agent (Ahrefs|Baidu|BLEXBot|Brandwatch|DotBot|Garlik|Knowledge|libwww-perl|Linkdex|MJ12bot|omgili|PetalBot|Proximic|Semrush|Seznam|Sogou|Tweetmeme|Trendiction|Wordpress) bad_bot
<RequireAll>
Require all Granted
Require not env bad_bot
</RequireAll>

<Files .htaccess>
order allow,deny
deny from all
</Files>

<Files 403.shtml>
order allow,deny
allow from all
</Files>


Adding more bots is easy. I only threw in the ones that were giving me trouble.

*Spiders will often index everything on your site, multiple times in succession, when they are on the rampage. This can chew masses of bandwidth, particularly if you have a lot of images and/or downloadable zips.

ajac63

Quote from: shadav on February 18, 2021, 11:14:48 PM
if your users post a lot of images
this mod can help a bit by making the filesizes smaller
https://custom.simplemachines.org/mods/index.php?mod=4082

using a program to optimize all the images on your site itself can help
we're not talking much but in the end every little bit does help

you can sign up for cloudflare and use their services also to help save a bit
Really glad you mention images as this is one of the main things that my hosting provider suggested, so thanks for the mod link.  There's a lot of posts on my forum and so checking them manually would take too long...

Thank you. :)
Another SMF believer

ajac63

Quote from: Antechinus on February 18, 2021, 11:33:39 PM
Sounds like it's the spiders that are causing a lot of the problem.* Banning the bad ones via .htacess is the way to go. That will nail the ones that ignore robots.txt. They'll still be able to ping your domain, but they won't get any bytes back (just a 403 error message).

The ,htaccess for the root directory of my domain looks like this:

#BlockBotsByUserAgent
SetEnvIfNoCase User-Agent (Ahrefs|Baidu|BLEXBot|Brandwatch|DotBot|Garlik|Knowledge|libwww-perl|Linkdex|MJ12bot|omgili|PetalBot|Proximic|Semrush|Seznam|Sogou|Tweetmeme|Trendiction|Wordpress) bad_bot
<RequireAll>
Require all Granted
Require not env bad_bot
</RequireAll>

<Files .htaccess>
order allow,deny
deny from all
</Files>

<Files 403.shtml>
order allow,deny
allow from all
</Files>


Adding more bots is easy. I only threw in the ones that were giving me trouble.

*Spiders will often index everything on your site, multiple times in succession, when they are on the rampage. This can chew masses of bandwidth, particularly if you have a lot of images and/or downloadable zips.
Thanks. :) Crawlers and spiders was the other thing my host mentioned, so this rings a bell.  Where exactly in my htaccess file do I paste this code and should I also use a robots.txt file in the route directory, would I need both or is it one or the other?
Another SMF believer

drewactual

Opcache if you're not using it.  My processor usage plummeted the most, but the bandwidth followed.  Having a process staged keeps the server from round tripping constantly.

Also... set up your expires and caches.  If you dont change a lot of things with presentation of your forum, you can cache the images, css, and other static items for as much as a year- not having to make the users browser make return trips for more data... in this same vane, if you're using http2 you can push everything at once if a visitor is new, and if they aren't? You can check in one call whether they have the file and NOT send it if they do. 

If you aren't using those, and you implement them? You're looking at likely 80% less processor usage and 50% less bandwidth consumed.

drewactual

And another thing.... lose the css and js check for updated files function IF you're done adjusting function and presentation... while you're at it combine and minify all css into one file and same for js... dont host any libraries you can get from a cdn... use their bandwidth instead of your own.  If you can host the images elsewhere, then same thing.

ajac63

OK, thanks for helping, but this code that Antechinus mentions:

#BlockBotsByUserAgent
SetEnvIfNoCase User-Agent (Ahrefs|Baidu|BLEXBot|Brandwatch|DotBot|Garlik|Knowledge|libwww-perl|Linkdex|MJ12bot|omgili|PetalBot|Proximic|Semrush|Seznam|Sogou|Tweetmeme|Trendiction|Wordpress) bad_bot
<RequireAll>
   Require all Granted
   Require not env bad_bot
</RequireAll>

<Files .htaccess>
order allow,deny
deny from all
</Files>

<Files 403.shtml>
order allow,deny
allow from all
</Files>

Where in the htaccess file do I put it, and if I use this, do I also need a robots.txt file as well and if so does it go in the route directory or under public_html?

My htaccess file currently has this in it:

RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301,NE]
# php -- BEGIN cPanel-generated handler, do not edit
# This domain inherits the "PHP" package.
# php -- END cPanel-generated handler, do not edit

Also, how do I know which crawlers and spiders are using up b/width?
Another SMF believer

drewactual

drop it all in at the end.

robots.txt is a good thing, but... the bad bots could care less that it's there or if it isn't.. they're going to do what they do regardless- only legitimate bots adhere to your request, which is basically what your robots.txt is- a request to robots.

you can look at the visitor logs on your server- if you're using cpanel there are generally several to choose from- that show you the statistics of your bandwidth usage... and by IP address...

a10

Had days with 100.000+ pageviews from spiders\bots\china. This curbed it:

RewriteCond %{QUERY_STRING} .
RewriteCond %{HTTP_USER_AGENT} Ahrefs|Baiduspider|bingbot|BLEXBot|Grapeshot|heritrix|Kinza|LieBaoFast|Linguee|Mb2345Browser|MegaIndex|MicroMessenger|MJ12bot|PiplBot|Riddler|Seekport|SemanticScholarBot|SemrushBot|serpstatbot|Siteimprove.com|trendictionbot|UCBrowser|MQQBrowser|Vagabondo|AspiegelBot|zh_CN|OPPO\sA33|zh-CN|YandexBot [NC]
RewriteRule ^.* - [F,L]
2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

ajac63

Quote from: drewactual on February 21, 2021, 12:28:38 PM
drop it all in at the end.

robots.txt is a good thing, but... the bad bots could care less that it's there or if it isn't.. they're going to do what they do regardless- only legitimate bots adhere to your request, which is basically what your robots.txt is- a request to robots.

you can look at the visitor logs on your server- if you're using cpanel there are generally several to choose from- that show you the statistics of your bandwidth usage... and by IP address...
OK, thanks for clearing that up :), this suggests that the .htaccess method is better than the robots.txt one which doesn't stop bad bots, but what about 'good bots' that also take up a lot of b/width such as indexing bots?  I suppose I shouldn't exclude those, but then how do I minimize the amount of b/width they use up?  What would I put in the .htaccess file for this?  Lastly, do I need a robots.txt file at all?  As I understand it, ppl can find out what is in the code just by entering mysitename.com/robots.txt so not exactly secure...
Another SMF believer

ajac63

Quote from: a10 on February 21, 2021, 05:16:51 PM
Had days with 100.000+ pageviews from spiders\bots\china. This curbed it:

RewriteCond %{QUERY_STRING} .
RewriteCond %{HTTP_USER_AGENT} Ahrefs|Baiduspider|bingbot|BLEXBot|Grapeshot|heritrix|Kinza|LieBaoFast|Linguee|Mb2345Browser|MegaIndex|MicroMessenger|MJ12bot|PiplBot|Riddler|Seekport|SemanticScholarBot|SemrushBot|serpstatbot|Siteimprove.com|trendictionbot|UCBrowser|MQQBrowser|Vagabondo|AspiegelBot|zh_CN|OPPO\sA33|zh-CN|YandexBot [NC]
RewriteRule ^.* - [F,L]

Thank you, should I add this after the code by Antichenus or instead of this?
Another SMF believer

a10

^^^ I'd guess it's best to use only one, maybe test them for a period, and look for which gives best results. Was just sharing what worked very well on my forum. Let's see if Antichenus can give an opinion, about any practical differences between the two apart from different lists of bots.
2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

ajac63

Quote from: a10 on February 22, 2021, 06:13:00 AM
^^^ I'd guess it's best to use only one, maybe test them for a period, and look for which gives best results. Was just sharing what worked very well on my forum. Let's see if Antichenus can give an opinion, about any practical differences between the two apart from different lists of bots.
Yes, like that - makes sense.  Anyway, soon comes the moment of truth; if I use this .htaccess method plus image optimization, there should be a big reduction in the b/width consumption rate?
Another SMF believer

ajac63

Sorry to bump thread, but...  Right, I put my forum back online today after being in maintenance mode and edited my htaccess file with the anti-bot code suggested by Antichenus, so that's all good. 
I've noticed though that the forum stats at the bottom is saying 'Most Online Today: 81' which is a bit odd seeing as the forum has only been online for about an hour at the time of posting.  Confused... ???
Another SMF believer

shadav

is your forum in the main public folder or is it in a sub folder?

though you are bouncing bad bots off the site with htaccess, smf may still register that they technically was online for half a second....

if your forum is in a subforum, try putting the code in the .htaccess of the main public folder

ajac63

Thanks for your help.  It's definitely in the public folder, that is index php and index php~, as is htaccess and settings.  The theme files are in a sub folder. 
Another SMF believer

Aleksi "Lex" Kilpinen

The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

ajac63

Quote from: Aleksi "Lex" Kilpinen on March 02, 2021, 12:07:40 AM
The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.
So some of those 81 visitors could have been crawlers, hmmm.  But then again, some could  have been genuine...  I'll do a search for more bad crawlers and add those.  Thanks for the heads up!  I've already just gone through one army of bots, I don't want another.
Another SMF believer

ajac63

Just a mini update...  The bandwidth issue I had has definitely been cleared up and has been very stable for about two weeks now, so thanks once again to everyone that helped in this thread.   :)
Another SMF believer

drewactual

Quote from: Aleksi "Lex" Kilpinen on March 02, 2021, 12:07:40 AM
The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.

hundreds?

I've experienced the china crawlers literally in the tens of thousands... the htaccess blocks took care of the majority of them but still in the late Jan to early Mar timeframe they return... every year... this year and last they were in the 4k range at peak, but before the htaccess blocking (which someone provided here- and thank you for that whomever you were) I would hit 32,000+ crawlers atop of my 300 or so users.   i was watching the metrics like a hawk at the time- half willing to kill them right there, and the winning half morbidly curious to see if the server could handle it... it did... but man, they came in FORCE.

the ALL originated from China- and not a one of them give a damn what you request or suggest - they just bear down on you... if they crash your server? they still don't care.... they stack up to bum rush it again just as soon as you're back up.... they crawl every. single. page. over and over... i despise them.  data harvesting is what they're doing- and it's amazing what they can put together by doing so- an innocuous comment here or there, a mention of job title/position there, a bit of information that means nothing by itself but when in aggregate of other comments both from the same user over time and then other sources? boom- they get a complete picture of whatever the subject matter is be it technical or personal.  they are something else...... and.... we (US) do it too... we do it as good as they do.  nothing is 'private', and with AI it's easier to make sense of the pile of formless data.... and forums are gold mines as rich or more so than social media.

Aleksi "Lex" Kilpinen

I do believe that happens, but must say in my years online I have never seen a single crawler come in with a force quite like that. :o
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

drewactual

they tidied up their annual run a couple weeks ago... now i'm seeing somewhere in between 400 and 1500 a day... that'll drop off by Summer to around 500 tops... then, next January, they'll bum rush again. 

what i should do is make copy of their IP's and adjust accordingly.  it 'should' be that simple.

2018 was 'the most', and I misspoke- it was 31k not 32... 2019 was just a hundred or so short of that and while i was blocking one range (before whomever it was left the post identifying the ranges they had encountered)... 2020 I had the htaccess blocks set up and same this year... i want to say it was just over 4k was peak this year. 

ajac63

Quote from: drewactual on March 17, 2021, 10:31:09 AM
Quote from: Aleksi "Lex" Kilpinen on March 02, 2021, 12:07:40 AM
The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.

hundreds?

I've experienced the china crawlers literally in the tens of thousands... the htaccess blocks took care of the majority of them but still in the late Jan to early Mar timeframe they return... every year... this year and last they were in the 4k range at peak, but before the htaccess blocking (which someone provided here- and thank you for that whomever you were) I would hit 32,000+ crawlers atop of my 300 or so users.   i was watching the metrics like a hawk at the time- half willing to kill them right there, and the winning half morbidly curious to see if the server could handle it... it did... but man, they came in FORCE.

the ALL originated from China- and not a one of them give a damn what you request or suggest - they just bear down on you... if they crash your server? they still don't care.... they stack up to bum rush it again just as soon as you're back up.... they crawl every. single. page. over and over... i despise them.  data harvesting is what they're doing- and it's amazing what they can put together by doing so- an innocuous comment here or there, a mention of job title/position there, a bit of information that means nothing by itself but when in aggregate of other comments both from the same user over time and then other sources? boom- they get a complete picture of whatever the subject matter is be it technical or personal.  they are something else...... and.... we (US) do it too... we do it as good as they do.  nothing is 'private', and with AI it's easier to make sense of the pile of formless data.... and forums are gold mines as rich or more so than social media.
I can identify with 'late Jan to early Mar timeframe...' as it was February when the sudden spike in b/width usage happened to me, but it's notable that in your case they were all from China and for some years.  I wonder who they're harvesting data on behalf of?
Another SMF believer

drewactual

Quote from: ajac63 on March 17, 2021, 08:47:31 PM
Quote from: drewactual on March 17, 2021, 10:31:09 AM
Quote from: Aleksi "Lex" Kilpinen on March 02, 2021, 12:07:40 AM
The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.

hundreds?

I've experienced the china crawlers literally in the tens of thousands... the htaccess blocks took care of the majority of them but still in the late Jan to early Mar timeframe they return... every year... this year and last they were in the 4k range at peak, but before the htaccess blocking (which someone provided here- and thank you for that whomever you were) I would hit 32,000+ crawlers atop of my 300 or so users.   i was watching the metrics like a hawk at the time- half willing to kill them right there, and the winning half morbidly curious to see if the server could handle it... it did... but man, they came in FORCE.

the ALL originated from China- and not a one of them give a damn what you request or suggest - they just bear down on you... if they crash your server? they still don't care.... they stack up to bum rush it again just as soon as you're back up.... they crawl every. single. page. over and over... i despise them.  data harvesting is what they're doing- and it's amazing what they can put together by doing so- an innocuous comment here or there, a mention of job title/position there, a bit of information that means nothing by itself but when in aggregate of other comments both from the same user over time and then other sources? boom- they get a complete picture of whatever the subject matter is be it technical or personal.  they are something else...... and.... we (US) do it too... we do it as good as they do.  nothing is 'private', and with AI it's easier to make sense of the pile of formless data.... and forums are gold mines as rich or more so than social media.
I can identify with 'late Jan to early Mar timeframe...' as it was February when the sudden spike in b/width usage happened to me, but it's notable that in your case they were all from China and for some years.  I wonder who they're harvesting data on behalf of?

i don't think they're targeting anything in particular or for anyone/thing in particular... i think they have a toy and by damn they're going to use it.

there was a time, pre-internet, there were three entities you had to concern about collecting information... it was (and this is likely going to surprise you if you didn't know) Financial industry- particularly credit cards... they collected all kinds of information about your spending habits... then, Churches- The Vatican first but in a close second was Mormons.... they didn't reach out into the world, but what they knew about their followers was startling and astounding... then, the usual suspects- gov'ts... particularly the IRS but it didn't stay there.  then other gov'ts...

this interweb thing is a boon for data collection.  you're already well documented no matter how well you hide from it.  AI makes it even more complicated as it collects a heaping pile of raw data without form and creates relationships between those datums on the fly- and that is where google and the social media's come in... followed closely by the US and China who are neck and neck as to who collects more and the applications it's applied.... they actually sell it, for one, after targeting known users of products... your, say, kayak makers and marketers don't have to take out super bowl ads at millions of dollars to target the perhaps 1.5% of watchers that may be interested in their product... now, they just pay the big harvesters and have precision target marketing. 

.... the fear is where ever else it goes... one thing is for certain- it never goes away. 

maybe five years ago, now, i got a FB message asking me "Is this you in this picture?" - and there i sat in my drunken stupor at a vegas blackjack table in the background of someone's vacation photo... i said "nope"... some months later i saw a picture in 'photos of you' section and there i sat- 'tagged'- but it wasn't me who confirmed it.  i wonder if someone i know did, or if they said "yeah right- that's you ya drunken clown"....

these robots/worms/crawlers aren't our friend... unless you're selling something- and then they are... somewhat... but not for long...

there ain't no hiding from it.  it's the way it is and the way its going to be and there is nothing you can do about it even if you unplug- because your friends have facebook and twitter, and you and they carry phones, and those phones keep up with where you are and who you're around..... unless you bounce completely off the grid they know... and then they know you've bounced...

wonderful, no? 

ajac63

Not wonderful at all >:(.  It seems cyberspace has become such a jungle of crawlers and spiders, that it's getting more and more difficult to know what to expect next.  As to pre-internet info gatherers, I didn't know about The Vatican but I did know about the practices of the Mormons; they wanted to know your life story as a prerequisite for joining.  I was literally 'this far' from being baptized when I luckily came to my senses and walked straight back out of their European head quarters opposite Exhibition Road, London. :-X
Another SMF believer

Advertisement: