Hi,
running on LAMP server (Centos/MariaDb), SMF 2.0.15, db in utf8, in virtual as a testing machine. Cca 200000 picture attachments, mixed old 1.x with new 2.x naming/hashing.
The attachment maintenance
http://XXX/forum/index.php?action=admin;area=manageattachments;sa=maintenance;
stays in a forever loop.
The first loop never finishes. It seems to me, that as soon as the request on line 1009 takes more than 3 seconds, it results in a immediate pause call, but without actually incrementing the iterating variable.
SELECT thumb.id_attach, thumb.id_folder, thumb.filename, thumb.file_hash
FROM XXX_attachments AS thumb
LEFT JOIN XXX_attachments AS tparent ON (tparent.id_thumb = thumb.id_attach)
WHERE thumb.id_attach BETWEEN 162500 AND 162500 + 499
AND thumb.attachment_type = 3
AND tparent.id_attach IS NULL
v .../Sources/ManageAttachments.php line 1009, executed in 3.84993601 seconds, 0.12439704 requests.
Simply the requests over thumb_id >160 000 take 5+ to 10 seconds in my case.
I moved the incrementing of the control variable just before the call to pause (removing it from for loop, putting in front of pause call). This worked for me. It seems that this bug is repeated in the same code in multiple places, ie. if there would be the case the innards of the for loop would take more than 3 seconds, the call to pause would result in neverending loop, as the variable is never incremented in such case.
Can some dev confirm this, please?
PS: The completely different problem is why it takes more than 3 seconds for such a small request. Because I'm not an SQL expert and I know next to nothing about debugging the queries, execution plans and whatnot, I don't know how to find out why are requests so slow. MariaDB is in the default settings, 5.5.60 1.el7_5 on test, 5.5.44 2.el7.centos on production. Both systems running on Centos 7.x
PPS: As this takes a lot of time on my system, sometimes the pause screen re-prompts me for the administrator password, actually breaking the whole lengthy process until I notice that and enter the password again. This maybe should be reconsidered as well.
I believe that you are wrong on all counts.
Smf has pauses to avoid overloading shared servers. It depends on traffic and activity on the site to move the pause. On a private server. There will be no traffic to trigger the movement.
If that endless loop was really endless, we would have heard complaints about it ages ago.
No, he's actually correct - it allows 3 seconds to complete a step but if it doesn't complete the step in that time, it is entirely possible it would have that problem. The traffic is for triggering background/scheduled tasks but this isn't background.
Ehm. Check code in ManageAttachments.php, function RepairAttachments.
On line 994 there is a for loop, which goes over thumbnails with a step of 500 thumbs (control variable is $_GET['substep']). In the loop body, there is a request for the thumbs (line 1009), and on the last line of the loop, line 1042 is call for pauseAttachmentMaintenance.
Function pauseAttachmentMaintenance checks if there was 3 seconds of time limit done. If not, it just returns, otherwise it constructs new url with $_GET['substep'] and continues. It does not check if the server is busy or whatnot, it just breaks the process always, I can't see any code which would do what you describe.
From this it's obvious that if there is time EXACTLY for one request over 3 seconds, the variable $_GET['substep'] is never incremented and the process ends in the endless loop IF the db request ALWAYS takes over 3 seconds.
So, to repeat:
1) On both my systems, ManageAttachments does end in an endless loop.
2) I have debugging turned on, it is obvious from the output, that the function repeats the SELECT with the SAME PARAMETERS again and again every 3 seconds. The select is in my first message.
3) If I fix the function in a way I mentioned, the function finishes just fine.
This problem manifests itself only when the SELECT always finishes in over 3 seconds. I don't know why it does manifest on my system, but it really does. And that SELECT takes between 3-10 seconds on perfectly idle machine with many CPU cores.
So why does it take over 3 seconds? Can you take the query you have, add EXPLAIN in front of it and put that to your server? The output will help explain why it's going slowly.
Possible reason could be,
that on id_thumb is no index,
you could try to set an index on this table and check if the runtime of the query do change.
I don't know and I don't understand the output of EXPLAIN either ;-)
1 SIMPLE thumb range PRIMARY,attachment_type PRIMARY 4 NULL 398 Using index condition; Using where
1 SIMPLE tparent ALL NULL NULL NULL NULL 85449 Using where; Not exists; Using join buffer (flat, BNL join)
There are these indexes:
PRIMARY BTREE Yes No id_attach 85449 A No
ID_MEMBER BTREE Yes No id_member id_attach 85449 A No
ID_MSG BTREE No No id_msg 17089 A No
attachment_type BTREE No No attachment_type 1 A No
count(*)
85449
It seems I make a mistake in my first post between max(id_attach) with count(*).
I need to note that my DB is a result of many upgrades, way back to 2004, from 1.x or maybe even previous versions. There was never clean install of 2.x on it, so it's possible that somewhere some upgrade scripts just did not attach the needed indexes?
And yes, adding simple index on id_thumb makes all queries run in 0.2s or less. I checked the current install of 2.0.15 and it does not contain the index either. So if I understand correctly, for each select there is a need to scan the whole _attachments table for the join.
Pr for 2.1 got created: https://github.com/SimpleMachines/SMF2.1/pull/5307
Thx. What about the for loop problem? As soon as there is any reason to have request run longer than 3s, it will loop over and over. This was partially solved for my case now, by adding the index. But it could rear its ugly head in the future as well.
Fixing the loop problem is significantly more complex as it relies on code shared in other places too :(
Since the index is there now,
the time spended on this query is constant "fast",
so from my point of view the issue got fixed.
When you keeping the issue,
you hardware/database settings doesn't fit the size of you smf setup.
Quote from: Arantor on January 01, 2019, 10:04:21 AM
Fixing the loop problem is significantly more complex as it relies on code shared in other places too :(
I think that just the incrementing the control variable should be put after the request itself, before call to pause function. That way it's guaranteed that it always get incremented, no matter how much time does it spend inside the query. Ie. change all for loops to while loops and put the increments inside the loop (before call to pause). One just needs to be careful if there are any break/continue statements in the loop which could wreak havoc with this approach.
But basically, I don't consider it too complex to fix. And I understand that from one point of view, the problem is fixed by adding the index. But still, it should be at least documented that such a potential problem resides there, even if it's considered WONTFIX.
That only fixes that one instance; this potentially affects all the places the same setup is used (which is far more than just that one instance)
Fixed in 2.1