SMF Database Error! emails started when 2.0.4 upgraded from 2.0.3

MrPhil · March 05, 2013, 01:32:30 PM

Quote from: Vince S on March 04, 2013, 01:26:23 AM
All the statements made simply CANNOT be true so something else is going on.

Chill out, @Arantor. I didn't take that statement as calling you a liar; I took it as "there is a logical contradiction here, so maybe we're still missing some information". Perhaps not stated in the most gracious or unambiguous manner, but not defaming you either.

I suggest that you brew yourself a nice cup of tea and sit by a window watching the world outside for a while, to decompress. Your technical knowledge is quite valuable; however, you're liable to fly off the handle at even innocent remarks. Most people seem to agree that often makes your input quite unpleasant.

Arantor · March 05, 2013, 01:43:34 PM

I might chill out if I didn't have the perpetual feeling that I was thoroughly wasting my time. When I have to explain something multiple times and it's still not being taken onboard, then I clearly have wasted my time (note there is a difference between not listening and not understanding and in almost every case the difference is rather evident)

I don't know about you, but my time is valuable to me. I only have so much time per day, and yet I choose to spend some of that time here, trying to help people. I don't like knowing that I'm wasting my time because people aren't spending time in listening to what I have to say, because it makes a mockery of trying to do a good deed, like in Wicked!, I guess it really is true that no good deed goes unpunished.

Though, I would ask you to reflect on what you're saying and what you think the effect will be. You're no doubt trying to suggest I moderate my behaviour. It's patronising and a little insulting. I don't tell you how to do what you do. I don't tell you how to spend your free time trying to help people.

Still, if my giving input is often quite unpleasant, I might as well stop providing it, no?

Kindred · March 05, 2013, 01:45:27 PM

ok folks... I think that's enough on that matter.

Yup, Arantor is a grumpy old man at 29.

Yes, Arantor knows his stuff.
Yes, people tend to ingore him, thinking they know better (and they are sometimes right and sometimes wrong)

Until there is more information and you have something to actually disprove Arantor's analysis (since he wrote the bloody code), I'd ask that this thread just be left alone (and we stop on the topic of how grumpy arantor can get or how stubborn some users are.)

Vince S · September 17, 2013, 09:39:20 AM

OK Since this problem hasn't gone away I am picking it up again as I have been collecting these error reports and over 2,000 have come in over the 8 or so months. SMF 2.05 made no difference (and not expected to). Attached is a graph I have produced from Excel. The frequency of database errors is around 10 most days and peaks to 50 sometimes, and a check of the data shows there were 27 days with zero errors. It will give Arantor great fodder to smirk knowingly into his coffee about shared hosts but I am suspecting this time something has been missed in his logic and asking for assistance to find it.

It is easy to take a cheap shot at shared hosts and label them all as one, but my experience has been SiteGround is way above the crowd and, whilst stuff may slip through intermittently for them they will be all over it when they realise it is an issue. To me it is looking more like there is something about how SMF is checking for the lost SQL connection that is triggering a false positive than it is an actual substantive lost connection. This is what SiteGround say and, rather than a dismissive style of response, could I please ask for straight shooting suggestions on what they might do to find where the disconnect (or whatever) is?

First and foremost I would like to assure you that we take such issues very seriously. We understand that our job is to provide stable service for all our clients and this is very important for you.

Regarding the issue in question. We cannot really help with a specific third party software if for some reason it fails to connect to the database. I understand this does not sound very helpful, but I have personally checked the server statistics for the last month and the MySQL server has 100% uptime. The average load for the server is below 1 (0.76), there are no signs of server overload and no service downtime. The suggestions from the SMF team members that you pasted in your previous reply might be true for another host but this is definitely not the case here.

Additionally I have personally checked that there are no users on this server that are in any way abusing the service. Not only the MySQL server.

I understand SMF is pretty commonly used message board, but if you went through their answers, you will always see "talk to your host, this is not our problem".

I am now looking into the settings for your SMF forum. I can see you have enabled the sending database errors functionality for SMF but according to the configuration file, the last error was recorded on:

Code: UNIX time 1373001742 is 07/05/2013 5:22am GMT.

Can you provide us with more information how the database connection is checked?

Of course we are more than willing to assist you regarding this matter. Currently the problem is that there is no obvious way to determine what is causing the database error. There are no records for MySQL errors. There are no records for your PHP or MySQL processes being killed so this is not the case either. This is why we will need a way to recreate the problem or at least see a definite log why the issue occurs.

Arantor · September 17, 2013, 10:11:22 AM

Firstly, I don't drink coffee, can't stand the stuff. I also don't like the insinuation that I shrug off things as being server related when it's an application problem; this is most definitely not an SMF level issue.

Secondly, this doesn't actually change anything at all. The MySQL protocol is very simple on this point, the connection is stored in a PHP variable of type Resource. (At least, it is on ext/mysql systems, it's type Object for ext/mysqli.)

All that happens is, as documented, the variable is checked as being a Resource before it is used. This is a variable maintained internally by PHP and MySQL (i.e. all SMF does is create it using the proper MySQL methods and the ext/mysql connector does the rest, be that via libmysqlclient or mysqlnd)

The code is quite clear on this point as shown in the 2.0.3 to 2.0.4 update:

Code Select

	if (!is_resource($connection))
		db_fatal_error();

That's all it is. is_resource is an internal PHP method for type checking.

The reality hasn't changed, it was failing just as much as it was before, only this time you know about it and you're seeing errors about it that you wouldn't have seen before (because SMF is now smarter about telling you there's a problem than it was before)

MySQL's 'uptime of 100%' is absolutely normal even in these situations because it's not the daemon that's failing, the connection is dropping, and there are multiple reasons why that might be, including but not limited to a bug in any of the dependency libraries, a bug in MySQL itself, or simply connection pooling gone wrong.

So, there really is nothing we can tell you. PHP is never told why the connection has dropped, merely that it has, as such SMF can't do any more than it is actually doing.

Kindred · September 17, 2013, 11:24:35 AM

and let me just say.... I have been running SMF sites on a shared host for 6+ years now.
I occasionally get the db error notification, but rarely... heck, I even get the notification on the one site which runs on a dedicated server

So, it's not necesarrily "shared hosts", but rather something in the configuration of your shared host which is causing the connection to fail (as Arantor has already explained, above). We can't be more specific than that...

cookiesandcreamuk · September 17, 2013, 07:21:45 PM

We used to get these emails quite a lot of them couldn't understand them so we setup a dummy smf forum invited ten people to it and in the end we found that if someone accidently puts the incorrect password in then it sends one of these and there was nothing in the error logs so we just unchecked the box in the administrator where it says send email on smf error.

and they stopped.....

Arantor · September 17, 2013, 07:22:20 PM

-sigh- I love how people fail to read what's actually been written and just assume everything's all wonderful when it really isn't.

Vince S · September 17, 2013, 08:51:14 PM

I have had this issue elevated to the Technical Support Manager level and the guy is obviously taking it seriously. In response to his earlier (below listed) response I collated an accurate definition of the problem as provided by Arantor earlier in this thread where he comprehensively and very usefully (thank you) talks about the reasons the change was made in 2.0.4. This is the response to that explanation:

What I wanted to know is what functionality you are using in order to get the database down emails and if you are using the SMF functionality for it. You have now clarified for me that this is indeed the case.

Regarding the issue in question. As you understand each MySQL server would have its certain limitations which are set exactly in order to prevent server overload, single user from abusing the service etc. If the MySQL connection from SMF reaches such limit, while idling it will be simply closed. The MySQL server would assume it is time to close the connection, hence when later checking if it is up, the report error will occur.

Those limits are set pretty high on our servers though. The connect_timeout, wait_timeout and delayed_insert_timeout options are all set to 10 seconds which is way more than enough for each normal operation. Those are all related to sleeping processes and the MySQL server simply lets those connection drop. I believe this is what is causing the problem with the MySQL down reports, while in the same time, there is no actual problem with the message board usability and the MySQL server.

What I would generally suggest you is to disable the mail notifications as a whole instead of trying to find a solution for a non-existing problem.

I then provided Arantor's further explanation to them (the one immediately following my previous post) and that generated this response from SiteGround:

"Yes, this reply does not really tell anything else besides what we already know."

I read from the above that the way the test is done for a dropped connection may not be a complete / entirely valid way to check for the condition as it may be detecting "doesn't matter" disconnects, as opposed to those "in the moment of servicing the request" style of errors that would lead to the infinite error reporting loops. Is there a way to put say a 2 second time limit on reporting the error as this would still prevent the infinite loop error log issue without then generating this unwanted but possibly designed in "sleep" type of connection dropping error? Over to the brains trust on that one! Thank you.

Arantor · September 17, 2013, 09:10:37 PM

Interesting, and also slightly misleading (and more than a little disingenuous towards me and SMF, but we'll let that slide)

The situation is simple: the connection is dropping. A two second wait isn't going to magically fix anything, the connection will still be dropped at the end of it. And then you'll be screwing the server up even more because all those scripts will then be backing up everything else on the server (since it's all ultimately queued up everywhere) waiting for those two seconds.

Yes, in theory, you could reconnect if the resource is gone. The reason I specifically did not do that is because you end up causing even more troubles for the server.

Your host is also a little misleading. The timeouts are set high, sure, and that's a good thing. The fact you have enough queries jammed up to cause disconnects is a sign that there is something very wrong somewhere, and your host doesn't know (or care) enough to spend any time diagnosing what's actually causing it by, say, looking in the slow query log. At the typical pricing of SiteGround (which is less per month than even the lowest paid employee gets per hour), it seems hardly surprising that they're not going to investigate too thoroughly.

You're inferring that 'the dropped connection' test is not valid. I hate to break it to you but there really is no better way of doing it with ext/mysql which is what SMF 1.1 and SMF 2.0 use. Actually, there is one slightly more accurate way, and that's to actually hammer the server with queries to check it's still alive but as you can imagine that's not ideal for performance. (And thus, not better)

The check is made before a query is fired, if the connection is already *known* to be false, there is no point in trying to execute a query because it's also going to be false (and lead to the infinite loop scenario in the first place)

Notice what they're suggesting for you to do: stop the mail notifications. As I explained, at length, this doesn't prevent there being a problem. It just prevents you being told about them. The user who had that page up will still see the 'there has been a database error' screen. They're still getting shafted, the server is still acting up, but you won't have any emails to tell you there's a problem.

I like how the host is effectively patting you on the head and telling you not to worry about it. There is a problem, just you won't see the symptoms of it. If you do that, fine, you can sleep soundly at night knowing there are no emails. It won't change that users get bad pages, but you won't see it so it won't matter, right?

Vince S · September 17, 2013, 10:41:45 PM

Quote from: Arantor on September 17, 2013, 09:10:37 PM
Interesting, and also slightly misleading (and more than a little disingenuous towards me and SMF, but we'll let that slide)

Regardless of how you take it (which of course I can't control) I have NO INTENTION to be disingenous (you do know this means deceitful / dishonest?) at any level or in any way to any body or SMF in general. That might sound like a clever thing to write but it is totally unhelpful to anything that is going on here. You need to stop this kinda shyte, please - it so detracts from your otherwise excellent content and contributions. And it is so easy to reflect back on the sender, but we don't need to go there either....!

Quote from: Arantor on September 17, 2013, 09:10:37 PM
At the typical pricing of SiteGround ... it seems hardly surprising that they're not going to investigate too thoroughly.

Whilst this style of thinking can reflect the truth, it is not born out by a reasoned appraisal of the situation. It looks more like cheap mud slinging that makes the slinger feel more comfortable. There are deep and real reasons that the specific host concerned here may actually care a lot about sorting issues of substance (as this issue is properly alleged to be) and they actually have the capacity to do so even within their seemingly low budget operations. These people have gone to extraordinary lengths in their business planning over the years and been very successful in driving up efficiency and market penetration, and frankly that is easy to see and is a pretty clear counter to the "oh they don't really give a stuff no matter what they say" more simplistic view.

Quote from: Arantor on September 17, 2013, 09:10:37 PM
... I like how the host is effectively patting you on the head and telling you not to worry about it.

I absolutely "get" how this kind of response is both predictable and an available host side cop out. But to stuff up this kind of thing so publicly for so long is simply not in their best interests. An unbiased non-technical helicopter view of the ACTUAL situation here would be more likely to conclude the core issue is related to SMF side issues than SiteGround.

I am not saying ANY OF the above to be controversial or make problems, simply pointing out we are missing something important and consequently nowhere near a viable answer (apart from the head-in-sand turn off error reporting which is looking more attractive by the minute!). The troubleshooting efforts, while very helpfully anchored to reality, are still happy to meander off into postulations of equally improbable scenarios as are decreed to be the bailiwick of the apparently-scum-like shared host. Both perspectives are as bad as each other, and sitting on one's digs and howling at the moon is not an answer for anyone.

Back on topic. Please note any of my suggestions are just that, they are not anchored in deep knowledge and only intended to encourage looking at the problem/s from a different perspective than the current Mexican stand-off we seem to have arrived at. I am not enjoying the undelectable role of "meat" in this indigestible sandwich, but it seems to be mine to wallow in. If I ask the host to go "looking in the slow query log" is that likely to get something useful? Is this suggestion predicated on an assumption that it is IMPOSSIBLE for the problem to be SMF-side, despite the host saying they think it is a time-out error and is driven by an SMF-side coding inadequacy? If so is that real? Or would we even know?

I don't know the answers to any of these q's of course. I am just trying to stimulate other-than-mono-syllabic responses, however eloquently they may be expressed, to actually take this topic in the direction of a real answer. But being the meat here I do so expect I will get further chewed on, regardless of good intentions. So let's have it then.....

Arantor · September 17, 2013, 11:13:44 PM

QuoteRegardless of how you take it (which of course I can't control) I have NO INTENTION to be disingenous (you do know this means deceitful / dishonest?) at any level or in any way to any body or SMF in general. That might sound like a clever thing to write but it is totally unhelpful to anything that is going on here.

The host is blaming it on SMF, when it's not SMF's fault. There is absolutely nothing more SMF could do to fix this than has already been done. In fact, SMF goes out of its way on this point to avoid causing the host more trouble, and yet the host blames SMF (when it's funny how other hosts don't generate the same problem)

QuoteThere are deep and real reasons that the specific host concerned here may actually care a lot about sorting issues of substance

Which is why the host is not actually investigating and telling you to take the easy route rather than the proper route.

QuoteThese people have gone to extraordinary lengths in their business planning over the years and been very successful in driving up efficiency and market penetration

Efficiency is not one of the terms I would ascribe to any of the shared hosting providers. For example, how many other sites are running on the same server as yours? It's a testament to efficiency that you don't even notice how many tens or even hundreds of other sites share the same server as yours. It's efficiency, to the point where it's practically a miracle they don't screw each other up every day...

Quoteoh they don't really give a stuff no matter what they say

Except I'm going off what you're telling me that they're telling you, and they're telling you not to worry about it because 'there isn't a problem'. Clearly there is otherwise you wouldn't be having failed database connections on any kind of regular basis. A couple per *year* might be understandable, more if your site were busy. But the rate you're seeing them is because the server is more loaded up than it should be.

Consider, they were telling you that they had timeouts of 10 seconds. That's 10 seconds to allow something to clear the queue... the fact you don't wait 10 seconds normally for each page means it's clearing the queue in that time. The fact that something is jamming the server up so badly should be an indication of something very wrong.

QuoteBut to stuff up this kind of thing so publicly for so long is simply not in their best interests. An unbiased non-technical helicopter view of the ACTUAL situation here would be more likely to conclude the core issue is related to SMF side issues than SiteGround.

And you'd be wrong. What more can I tell you? I've told you what the cause is, I've told you why it's happening, even SiteGround actually quietly admits the same thing as I do?

Oh, wait, I know what you want me to tell you. You want me to tell you that SMF's at fault, unfortunately neither I nor anyone in the team here can do that because *it's not at fault*. A non-technical view might conclude that, just as a non-technical view might conclude the sun goes around the earth because to all non-technical views, that is indeed how it appears.

QuoteI am not saying ANY OF the above to be controversial or make problems, simply pointing out we are missing something important and consequently nowhere near a viable answer (apart from the head-in-sand turn off error reporting which is looking more attractive by the minute!).

There is a very simple, viable answer. Go to a host that doesn't run hundreds of websites to a server.

Here's the thing... there are only a handful of people who have reported this issue as occurring with any regularity over the last two years. All of them on heavily run shared hosting. As opposed to the *thousands* of active SMF installations of varying sizes (even the 52 million post site that runs on a single server), who don't have this issue.

You spoke of an unbiased non-technical helicopter view. How does a problem that affects a few people all who have one thing in common (heavily crammed hosts) point to the software having a fault when the software is running on many, many, many different hosting configurations without the same problem manifesting? Correlation is not causation, you're attributing the fault to SMF because SMF's flagging up there's a problem and the host is denying all knowledge of it - SiteGround certainly wouldn't be the first to do so.

QuoteBack on topic. Please note any of my suggestions are just that, they are not anchored in deep knowledge and only intended to encourage looking at the problem/s from a different perspective than the current Mexican stand-off we seem to have arrived at.

There's no problem from my side. It's only too clear what the problem is. The only person standing off is you because you won't accept that your host might not be giving you the full story.

QuoteIf I ask the host to go "looking in the slow query log" is that likely to get something useful?

Not really.

QuoteIs this suggestion predicated on an assumption that it is IMPOSSIBLE for the problem to be SMF-side, despite the host saying they think it is a time-out error and is driven by an SMF-side coding inadequacy? If so is that real? Or would we even know?

It is a timeout error - and the timeout causes MySQL to drop the connection, simple as that. And it times out because there is a vast number of queries all running from hundreds of sites on a single server. SiteGround aren't nearly the worst (GoDaddy has been known to host in excess of a couple of thousand sites on a single server) but they aren't running the sorts of contention rates that are ideal.

Just think for a minute... even 100 sites, running software issuing 10 queries a second (which is low, in the scheme of things), means 1000 queries per second hitting the database. It only needs a handful of those - from sites that aren't even yours - to gum up the works for everyone else.

QuoteI am just trying to stimulate other-than-mono-syllabic responses, however eloquently they may be expressed, to actually take this topic in the direction of a real answer. But being the meat here I do so expect I will get further chewed on, regardless of good intentions. So let's have it then.....

I'd like to think I'm a little more eloquent than throwing out monosyllabic responses. It's not my fault that you're unwilling to accept what I've told you and what I've shown you. After all, I'm just some guy on the internet, versus a corporation who piles 'em high and sells 'em cheap, they're obviously more reputable.

You're already seeing too many correlations implying causations - you assumed off the bat that it was the upgrade to 2.0.4 that caused the issue (which it wasn't), and that it was suddenly a new issue in 2.0.4 (which it wasn't) and that pretending an error isn't happening is going to make it all better (it won't).

Here's the deal, then. You have three options:
1. Turn the emails off. You won't get emailed, the world will continue but some users of your site, and/or search engines will continue to see database error messages as pointing to something very wrong with your site. But you won't get emails.

2. Change hosts. You'll probably pay a bit more, odds are things will be a little faster and the errors won't keep happening.

3. Move software. Of course it won't happen then either. (It will, just most of the other systems won't bother to tell you there's been such a failure either. So, much the same as 1. really)

I'm done at this point. There really is nothing more I can say or do to move this along. I suggest one of the other SMF code wizards steps in at this point. Maybe they can explain it to you in a way that you'll accept.

Cyberhost · September 18, 2013, 03:56:06 PM

You can turn the emails down here: Admin > Configuration > Server Settings > General > Database

Vince S · September 20, 2013, 01:04:48 AM

Hmm, this could be tricky. A seasoned professional is providing a credible sounding rebuttal to Arantor's explanation, in essence suggesting the way the error checking routine has been changed is amateurish and needs to be done properly. Because I recognise that it is NOT good form to pass people's posts on without their direct permission I am removing ID info from it but the content is highly valid and will hopefully lead to a better solution to this problem than anything that is currently on the slab. I also want to say that other (unpublished) aspects of how this particular gentleman has dealt me leave me having the highest regard for how he has dealt with my queries and I have nothing but total respect for this person. To put his response in context I will first provide my latest question:

G'day *****, if you don't mind I would like to try your patience a little more please. I have put your logical assessment of the situation to the SMF forum and rec'd a very reasoned detailed, if a little tedious, response which can be viewed here: http://www.simplemachines.org/community/index.php?topic=497039.msg3610086#msg3610086

The total thrust of their programmer's explanation is that indeed the query is being lost on time-out as the sql handler is not getting to the respective query due to other prior traffic from other sites and indeed the error report is correctly informing of a real issue. He notes other forum software drop these errors and allow the problem to exist, he believes SMF's approach is more correct. The basic statement is that this outcome is a standard reality of shared hosts everywhere and that you guys would know and expect it and have little incentive to rectify it as there is not truly a way to do so within the economic business case that allows the whole shared host phenomenon to exist in the first case.

If all that is (basically) true then acceptance is my best choice, turn off the error reporting and we are done! But I am the kind of guy that likes to take a proper look at things and try and lift the game for the benefit of my fellow man. So I am pestering you to ask if the real thing here is to look at more action that may identify something like a whole bunch of pressure needlessly being put on the sql server (or whatever) that can be filtered / changed host side? Or is it as it appears and needs to slip through to the keeper since any shared host level improvement would be impractical? Thank you.

The response:

This is not really a matter of trying out our patience but those SMF person claims are, to say the least annoying. Blaming literally every host out there (as you know this issue was reported by users on many different hosting companies) that their particular piece of software reports bogus error messages is cute but not really constructive in any way. I do not wish to argue with him and I also believe there is nothing much more that can be said regarding this matter.

Still, to address the major points made in the particular post:

The host is blaming it on SMF, when it's not SMF's fault. There is absolutely nothing more SMF could do to fix this than has already been done. In fact, SMF goes out of its way on this point to avoid causing the host more trouble, and yet the host blames SMF (when it's funny how other hosts don't generate the same problem)

Actually I believe there is something more that can be done. The SMF database connection checks can be rewritten in a way that they do not remain idle, thus not reaching the mysql timeout and not getting the error at all. An idling connection will be left by the MySQL server to time out as I previously explained. If this occurs ... well there is nothing we could do.

It is a timeout error - and the timeout causes MySQL to drop the connection, simple as that.

It is simple as that indeed. However, this:

And it times out because there is a vast number of queries all running from hundreds of sites on a single server.

is not the cause for the timeout. I really cannot stress this more. There is no problem with the MySQL service, there is an issue with the way those checks are ran. And the following statement:

Just think for a minute... even 100 sites, running software issuing 10 queries a second (which is low, in the scheme of things), means 1000 queries per second hitting the database. It only needs a handful of those - from sites that aren't even yours - to gum up the works for everyone else.

simply shows how little (and I am tempted to say none at all) this person knows about web hosting services. Stating 1000 queries per second will be problematic is ridiculous.

Granted you are not running a complex query there can be up to 100,000 queries .Of course this depends on the type of query being ran, but of course we are closely monitoring slow queries for our servers in order to avoid users with many slow queries to affect the performance on other users on the same server. I am unsure how well you are familiar with our shared hosting solution (and frankly I am not sure if this was promoted with all technicalities somewhere by our marketing department) but on our shared servers the MySQL is hosted on a separate standalone SAS disc, additionally boosting its performance.

Anyway, we are not even close to those values on our shared servers as despite running a lot of servers, we always make sure to have more than enough system resources so that all websites can run smoothly and without any problems. To give you some statistics - for the last 24 hours in the most busy hour (approximately 3 times more database queries than normally) there were 1786584 queries to the databases on the server, that averages less than 500 per second. This was at the point when we were running our daily database backups for all databases on the server. No database connection errors were recorded during this period.

I would like to further explain something about the database error reported by the SMF software.

If a visitor for your website sees such an error (regardless of the fact if it is a regular visitor, search engine web crawler or a request made from another server) this error will generate an HTTP response code. The particular error for SMF (unable to connect to database) returns Error 503 - Service Temporary Unavailable. If you check the statistics for your domain name via cPanel -> Awstats you will see this error code was returned exactly 4 times since the beginning of September. This does not even indicate necessarily an error for the forum. Those are all 503 response codes that were received for your hunterdogs.org domain name since the beginning of September.

Finally I would like to summarize including some information about the limits I provided in a previous reply. As I mentioned we have set the timeout settings for MySQL to 10 seconds. This is obviously at points not enough for the checks that are ran to go through. Note that when a connection is dropped software that uses MySQL database will simply re-establish it the next time you try to load a resource. In the check that is ran, it does not re-establish the connection. It simply checks whether it is active and since it isn't we get this error. We won't increase the timeout values for MySQL on our shared servers, those are perfectly OK so that they do not cause problems with software ran on our servers and in the same time low enough in order to prevent from an excessive number of sleeping or idle connections.

That is pretty much everything I can think of now that is worth mentioning.

One more thing I would like to mention. This ticket has become too long and complex and it is not really related to a server side issue that we can help with. Sending it back to our technical support might cause confusion and it will take too much time to get familiar with the issue and respond to you (especially considering, multiple and quite lengthy forum posts must be reviewed). Thus I am closing this ticket. If you wish to further discus anything or require any input, please contact me directly via mail: ********@siteground.com. I will do my best to respond as fast as possible.

Kind Regards,

********
Technical Support Manager

I particularly like the bit where a check of the actual error logs says in September our site experienced 4 actual real true errors (not necessarily on SMF related pages) but the error reports rec'd from SMF were a bit over 100 in the same period! That is prima facie evidence that things are not as simple as previously explained. Let me postulate another theory, what if every browsing session that is closed by the user leaving doesn't close the connection for the end of their last query which effectively times out because they weren't there to look at it and SQL reports to SMF an error of a dropped connection - because the user has pfaffed off in the mean time! This is just me guessing - I know I don't know - but I am looking for logic where everybody gets to be right, something additional is done and the real problem goes away. Unbelievably, the real problem actually always was and still is the wall of emails, despite what some may wish to believe!

Kindred · September 20, 2013, 07:41:33 AM

Quote from: Kindred on September 17, 2013, 11:24:35 AM
and let me just say.... I have been running SMF sites on a shared host for 6+ years now.
I occasionally get the db error notification, but rarely... heck, I even get the notification on the one site which runs on a dedicated server

So, it's not necesarrily "shared hosts", but rather something in the configuration of your shared host which is causing the connection to fail (as Arantor has already explained, above). We can't be more specific than that...

Arantor · September 20, 2013, 10:35:32 AM

QuoteThe SMF database connection checks can be rewritten in a way that they do not remain idle, thus not reaching the mysql timeout and not getting the error at all. An idling connection will be left by the MySQL server to time out as I previously explained. If this occurs ... well there is nothing we could do.

Ask the host what he would suggest, exactly. There are precisely two methods in ext/mysql for checking the connection. One is to check the connection resource, the other is to actually fire a query at the database server and hope for the best. And then reconnecting in the event of a failure? That's going to make it so much worse.

Your host is still misrepresenting the problem. If PHP sends a query to MySQL and the server doesn't respond due to a timeout, there is a specific error relating to that, the infamous "MySQL has gone away" error. That is not what is being triggered here.

Quotesimply shows how little (and I am tempted to say none at all) this person knows about web hosting services. Stating 1000 queries per second will be problematic is ridiculous.

Those figures weren't meant to be literal.

QuoteTo give you some statistics - for the last 24 hours in the most busy hour (approximately 3 times more database queries than normally) there were 1786584 queries to the databases on the server, that averages less than 500 per second. This was at the point when we were running our daily database backups for all databases on the server. No database connection errors were recorded during this period.

So, let's get this straight, 1.7m queries in one hour. 29776 queries per minute, 496 per second.

And 1000 per second won't be ridiculous? Of course it could be!

QuoteNote that when a connection is dropped software that uses MySQL database will simply re-establish it the next time you try to load a resource.

No, it won't, that's the point. If a connection is dropped, the software specifically have to re-connect. Reconnecting in the event of a drop, however, is actually a bad idea (and as I said, specifically not done in that code) because if you do have a connection failure, proceeding to try to reconnect will actually generate even more load on the server than gracefully exiting.

QuoteLet me postulate another theory, what if every browsing session that is closed by the user leaving doesn't close the connection for the end of their last query which effectively times out because they weren't there to look at it and SQL reports to SMF an error of a dropped connection

It's possible, though unlikely. For that to be the case you'd have to have specifically requested persistent connections in the database area of the admin panel; using persistent connections is not the default because it drives up the load on shared hosts.

Your host is clearly coming at it from the perspective that all web apps behave the same way and that they use persistent connections - SMF has not used this by default in its 10 year history.

If you actually notice what I said, I wasn't blaming *all* shared hosting. Or indeed all hosting of any group. I have seen the same underlying condition occur in WordPress, though WordPress is less gracious about it failing (and it certainly doesn't bother to tell you that it's happened)

There are also holes to the story that haven't been touched on - you clearly get more errors than the 4 reported. Why is awstats not flagging those up, out of interest? Seems to me that something is actually more fishy than you're being told.

We've presented both our arguments, I like the implication that I'm not a seasoned professional in the meantime, and you don't have an answer. The reality is, I don't have any reason to pretend the situation is anything other than it is. I'm not a member of the SMF team, I don't owe them anything and if there is something awry, I have no problem calling them out on anything. You will see throughout this forum that I have been critical of various things over the years.

On the other hand, you have a technical manager (who, in my experience, is usually less technical than the employees under them, and is almost always a customer relations person) telling you that there are no problems with the service that you are paying him to provide.

I've never seen this occur on my VPS in the last 7 years (neither the underlying condition nor the surface condition which is the only thing that changed). The only time I've ever had email notifications out of that were times where MySQL was down or otherwise impaired (like the time I accidentally let the disk get a bit too full)

That said, the situation as described could entirely conceivably occur on dedicated servers. It'll be much rarer but if you have a lot of queries, especially longer running ones, sure, it's conceivable that a confluence of bad luck could cause it all to hit together.

It's just much, much more common on shared servers because of the thousands of other queries happening at the same time.

Vince S · September 26, 2013, 06:17:39 PM

In the spirit of Arantor's ongoing informed engagement in this subject I have asked the snr manager at SiteGround if he can make a further contribution, which he has done. Clearly in the below he has now contributed all he is able so I am respectfully asking the suggestions he has made be chased down as, whilst he cannot give the exact answer we seek, he certainly makes a good case that the way SMF checks for and generates these errors is itself flawed. Here 'tis:

As explained the SMF functionality is checking whether the database connection is actually an active resource using a PHP function. That is OK. If you run this check after the connection is established there won't be a problem, unless of course there is indeed a problem with the database.

Additionally as explained: "If PHP sends a query to MySQL and the server doesn't respond due to a timeout, there is a specific error relating to that, the infamous "MySQL has gone away" error. That is not what is being triggered here." That is not what is being checked at all. That is the point. The is_resource function checks if a variable is resource. In this case MySQL connection. It does not send a query to the MySQL. As the previously opened MySQL connection is already closed, the function will check and seeing the variable is no longer a resource it will return the error.

Note that the use of is_resource isn't really necessary. The mysql_connect function itself returns false on connection failure anyway. What the is_resource function does is to check if an already established connection is active.

I am not an expert PHP developer. Not at all. I am using PHP code in my daily activities indeed but for simple scripts that would help me in my work. That is why I am not in a position to argue with someone that does it for a living, but the logic behind the explanations seems flawed.

For completeness, this is the q I asked:

I would like to take you up on your kind offer to assist further. Presumably the basic problem is generated SMF side and, if they can implement a fix for it, they have the appetite to do so. But they need help. Earlier you made the statement: "The SMF database connection checks can be rewritten in a way that they do not remain idle, thus not reaching the mysql timeout and not getting the error at all. An idling connection will be left by the MySQL server to time out". Could you please advise what you would suggest, exactly?

I was given the following further explanation of the situation, if it helps: "There are precisely two methods in ext/mysql for checking the connection. One is to check the connection resource, the other is to actually fire a query at the database server and hope for the best. And then reconnecting in the event of a failure? That's going to make it so much worse. The problem definition itself still has some ambiguity. If PHP sends a query to MySQL and the server doesn't respond due to a timeout, there is a specific error relating to that, the infamous "MySQL has gone away" error. That is not what is being triggered here."

From earlier in the ticket the following defines exactly what is being done now. How would you suggest this be changed?

The code is quite clear on this point as shown in the 2.0.3 to 2.0.4 update:
Code: [Select]

if (!is_resource($connection))
db_fatal_error();

That's all it is. is_resource is an internal PHP method for type checking.

Thank you again for your assistance solving this irksome problem.

Vince S · October 04, 2013, 08:10:57 PM

Bump - @Arantor, does this info from SG suggest a better error handling routine could be coded into SMF?

Arantor · October 04, 2013, 08:13:19 PM

They're trying to suggest it should be done better, but I'd *love* to hear what they think we should do, because I really cannot see how it can be improved upon - and note that no-one else has suggested a better alternative in the last 2 1/2 years.

If they can suggest a better solution that doesn't cause other problems, I'll get it into SMF's trunk. But I'd be a little skeptical, personally.

Vince S · October 04, 2013, 09:06:45 PM

Hmmm, obviously the SG guy has pointed to the logical conundrum with the current methodology and equally said he can't provide a definitive answer. I also am just a peripheral player with the only virtue of being a stubborn old coot that wants to see it sorted if possible, and does appreciate the situation and participation. All I am saying is that the current solution, whilst it may be best effort / good intent, is plainly failing at some level as the evidence is pretty compelling that a substantial number of error reports are most probably false positives.

So count us out, we don't know what we're talking about.

That leaves Google, and in 2 mins I see this suggestion: http://stackoverflow.com/questions/13018227/reproduce-mysql-error-the-server-closed-the-connection-node-js Is that the go?

News:

SMF Database Error! emails started when 2.0.4 upgraded from 2.0.3

MrPhil