News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

How do I reduce bandwidth usage?

Started by ajac63, February 18, 2021, 10:26:28 PM

Previous topic - Next topic

ajac63

Firstly, apologies if this is in the wrong section of SMF and also for not having posted for a while, I've had problems...  ::)  Basically, a fair amount of bandwidth has been used up for the current month for my forum, over 86%, and so I'm looking for ways to reduce further excessive bandwidth usage.  What things should I be doing?  I've discovered that spider and crawlers can use up bandwidth a lot and that a robots.txt file in the route directory can help with this, but I'm not that familiar with the syntax.  I've also discovered that turning on http compression can help, but don't know if SMF powered forums have this by default.

To stop more bandwidth being used up, I put my forum offline for maintenance about a day ago, but the usage has gone up to 89.45%. How can more bandwidth be used up when it's offline?

Thanks if anyone can help :)
Another SMF believer

shadav

if your users post a lot of images
this mod can help a bit by making the filesizes smaller
https://custom.simplemachines.org/mods/index.php?mod=4082

using a program to optimize all the images on your site itself can help
we're not talking much but in the end every little bit does help

you can sign up for cloudflare and use their services also to help save a bit

Antechinus

Sounds like it's the spiders that are causing a lot of the problem.* Banning the bad ones via .htacess is the way to go. That will nail the ones that ignore robots.txt. They'll still be able to ping your domain, but they won't get any bytes back (just a 403 error message).

The ,htaccess for the root directory of my domain looks like this:

#BlockBotsByUserAgent
SetEnvIfNoCase User-Agent (Ahrefs|Baidu|BLEXBot|Brandwatch|DotBot|Garlik|Knowledge|libwww-perl|Linkdex|MJ12bot|omgili|PetalBot|Proximic|Semrush|Seznam|Sogou|Tweetmeme|Trendiction|Wordpress) bad_bot
<RequireAll>
Require all Granted
Require not env bad_bot
</RequireAll>

<Files .htaccess>
order allow,deny
deny from all
</Files>

<Files 403.shtml>
order allow,deny
allow from all
</Files>


Adding more bots is easy. I only threw in the ones that were giving me trouble.

*Spiders will often index everything on your site, multiple times in succession, when they are on the rampage. This can chew masses of bandwidth, particularly if you have a lot of images and/or downloadable zips.

ajac63

Quote from: shadav on February 18, 2021, 11:14:48 PM
if your users post a lot of images
this mod can help a bit by making the filesizes smaller
https://custom.simplemachines.org/mods/index.php?mod=4082

using a program to optimize all the images on your site itself can help
we're not talking much but in the end every little bit does help

you can sign up for cloudflare and use their services also to help save a bit
Really glad you mention images as this is one of the main things that my hosting provider suggested, so thanks for the mod link.  There's a lot of posts on my forum and so checking them manually would take too long...

Thank you. :)
Another SMF believer

ajac63

Quote from: Antechinus on February 18, 2021, 11:33:39 PM
Sounds like it's the spiders that are causing a lot of the problem.* Banning the bad ones via .htacess is the way to go. That will nail the ones that ignore robots.txt. They'll still be able to ping your domain, but they won't get any bytes back (just a 403 error message).

The ,htaccess for the root directory of my domain looks like this:

#BlockBotsByUserAgent
SetEnvIfNoCase User-Agent (Ahrefs|Baidu|BLEXBot|Brandwatch|DotBot|Garlik|Knowledge|libwww-perl|Linkdex|MJ12bot|omgili|PetalBot|Proximic|Semrush|Seznam|Sogou|Tweetmeme|Trendiction|Wordpress) bad_bot
<RequireAll>
Require all Granted
Require not env bad_bot
</RequireAll>

<Files .htaccess>
order allow,deny
deny from all
</Files>

<Files 403.shtml>
order allow,deny
allow from all
</Files>


Adding more bots is easy. I only threw in the ones that were giving me trouble.

*Spiders will often index everything on your site, multiple times in succession, when they are on the rampage. This can chew masses of bandwidth, particularly if you have a lot of images and/or downloadable zips.
Thanks. :) Crawlers and spiders was the other thing my host mentioned, so this rings a bell.  Where exactly in my htaccess file do I paste this code and should I also use a robots.txt file in the route directory, would I need both or is it one or the other?
Another SMF believer

drewactual

Opcache if you're not using it.  My processor usage plummeted the most, but the bandwidth followed.  Having a process staged keeps the server from round tripping constantly.

Also... set up your expires and caches.  If you dont change a lot of things with presentation of your forum, you can cache the images, css, and other static items for as much as a year- not having to make the users browser make return trips for more data... in this same vane, if you're using http2 you can push everything at once if a visitor is new, and if they aren't? You can check in one call whether they have the file and NOT send it if they do. 

If you aren't using those, and you implement them? You're looking at likely 80% less processor usage and 50% less bandwidth consumed.

drewactual

And another thing.... lose the css and js check for updated files function IF you're done adjusting function and presentation... while you're at it combine and minify all css into one file and same for js... dont host any libraries you can get from a cdn... use their bandwidth instead of your own.  If you can host the images elsewhere, then same thing.

ajac63

OK, thanks for helping, but this code that Antechinus mentions:

#BlockBotsByUserAgent
SetEnvIfNoCase User-Agent (Ahrefs|Baidu|BLEXBot|Brandwatch|DotBot|Garlik|Knowledge|libwww-perl|Linkdex|MJ12bot|omgili|PetalBot|Proximic|Semrush|Seznam|Sogou|Tweetmeme|Trendiction|Wordpress) bad_bot
<RequireAll>
   Require all Granted
   Require not env bad_bot
</RequireAll>

<Files .htaccess>
order allow,deny
deny from all
</Files>

<Files 403.shtml>
order allow,deny
allow from all
</Files>

Where in the htaccess file do I put it, and if I use this, do I also need a robots.txt file as well and if so does it go in the route directory or under public_html?

My htaccess file currently has this in it:

RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301,NE]
# php -- BEGIN cPanel-generated handler, do not edit
# This domain inherits the "PHP" package.
# php -- END cPanel-generated handler, do not edit

Also, how do I know which crawlers and spiders are using up b/width?
Another SMF believer

drewactual

drop it all in at the end.

robots.txt is a good thing, but... the bad bots could care less that it's there or if it isn't.. they're going to do what they do regardless- only legitimate bots adhere to your request, which is basically what your robots.txt is- a request to robots.

you can look at the visitor logs on your server- if you're using cpanel there are generally several to choose from- that show you the statistics of your bandwidth usage... and by IP address...

a10

Had days with 100.000+ pageviews from spiders\bots\china. This curbed it:

RewriteCond %{QUERY_STRING} .
RewriteCond %{HTTP_USER_AGENT} Ahrefs|Baiduspider|bingbot|BLEXBot|Grapeshot|heritrix|Kinza|LieBaoFast|Linguee|Mb2345Browser|MegaIndex|MicroMessenger|MJ12bot|PiplBot|Riddler|Seekport|SemanticScholarBot|SemrushBot|serpstatbot|Siteimprove.com|trendictionbot|UCBrowser|MQQBrowser|Vagabondo|AspiegelBot|zh_CN|OPPO\sA33|zh-CN|YandexBot [NC]
RewriteRule ^.* - [F,L]
2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

ajac63

Quote from: drewactual on February 21, 2021, 12:28:38 PM
drop it all in at the end.

robots.txt is a good thing, but... the bad bots could care less that it's there or if it isn't.. they're going to do what they do regardless- only legitimate bots adhere to your request, which is basically what your robots.txt is- a request to robots.

you can look at the visitor logs on your server- if you're using cpanel there are generally several to choose from- that show you the statistics of your bandwidth usage... and by IP address...
OK, thanks for clearing that up :), this suggests that the .htaccess method is better than the robots.txt one which doesn't stop bad bots, but what about 'good bots' that also take up a lot of b/width such as indexing bots?  I suppose I shouldn't exclude those, but then how do I minimize the amount of b/width they use up?  What would I put in the .htaccess file for this?  Lastly, do I need a robots.txt file at all?  As I understand it, ppl can find out what is in the code just by entering mysitename.com/robots.txt so not exactly secure...
Another SMF believer

ajac63

Quote from: a10 on February 21, 2021, 05:16:51 PM
Had days with 100.000+ pageviews from spiders\bots\china. This curbed it:

RewriteCond %{QUERY_STRING} .
RewriteCond %{HTTP_USER_AGENT} Ahrefs|Baiduspider|bingbot|BLEXBot|Grapeshot|heritrix|Kinza|LieBaoFast|Linguee|Mb2345Browser|MegaIndex|MicroMessenger|MJ12bot|PiplBot|Riddler|Seekport|SemanticScholarBot|SemrushBot|serpstatbot|Siteimprove.com|trendictionbot|UCBrowser|MQQBrowser|Vagabondo|AspiegelBot|zh_CN|OPPO\sA33|zh-CN|YandexBot [NC]
RewriteRule ^.* - [F,L]

Thank you, should I add this after the code by Antichenus or instead of this?
Another SMF believer

a10

^^^ I'd guess it's best to use only one, maybe test them for a period, and look for which gives best results. Was just sharing what worked very well on my forum. Let's see if Antichenus can give an opinion, about any practical differences between the two apart from different lists of bots.
2.0.19, php 8.0.23, MariaDB 10.5.15. Mods: Contact Page, Like Posts, Responsive Curve, Search Focus Dropdown, Add Join Date to Post.

ajac63

Quote from: a10 on February 22, 2021, 06:13:00 AM
^^^ I'd guess it's best to use only one, maybe test them for a period, and look for which gives best results. Was just sharing what worked very well on my forum. Let's see if Antichenus can give an opinion, about any practical differences between the two apart from different lists of bots.
Yes, like that - makes sense.  Anyway, soon comes the moment of truth; if I use this .htaccess method plus image optimization, there should be a big reduction in the b/width consumption rate?
Another SMF believer

ajac63

Sorry to bump thread, but...  Right, I put my forum back online today after being in maintenance mode and edited my htaccess file with the anti-bot code suggested by Antichenus, so that's all good. 
I've noticed though that the forum stats at the bottom is saying 'Most Online Today: 81' which is a bit odd seeing as the forum has only been online for about an hour at the time of posting.  Confused... ???
Another SMF believer

shadav

is your forum in the main public folder or is it in a sub folder?

though you are bouncing bad bots off the site with htaccess, smf may still register that they technically was online for half a second....

if your forum is in a subforum, try putting the code in the .htaccess of the main public folder

ajac63

Thanks for your help.  It's definitely in the public folder, that is index php and index php~, as is htaccess and settings.  The theme files are in a sub folder. 
Another SMF believer

Aleksi "Lex" Kilpinen

The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.
Slava
Ukraini!


"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

ajac63

Quote from: Aleksi "Lex" Kilpinen on March 02, 2021, 12:07:40 AM
The .htaccess should catch everything before it gets to SMF, but could be you just had a wave of crawlers that were not included in the .htaccess rules. Some "search crawlers" can really be a nuisance, and may appear in numbers reaching a hundred in a small timeframe.
So some of those 81 visitors could have been crawlers, hmmm.  But then again, some could  have been genuine...  I'll do a search for more bad crawlers and add those.  Thanks for the heads up!  I've already just gone through one army of bots, I don't want another.
Another SMF believer

ajac63

Just a mini update...  The bandwidth issue I had has definitely been cleared up and has been very stable for about two weeks now, so thanks once again to everyone that helped in this thread.   :)
Another SMF believer

Advertisement: