News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

SMF 2.0.x uses environment default collation/charset

Started by shawnb61, July 05, 2017, 11:09:47 PM

Previous topic - Next topic

shawnb61

... and this can cause data corruption. 

Steps to reproduce:
- Before installing SMF 2.0.X, set your environment's database default collation to UTF8_general_ci
- Install 2.0.x, but do NOT select the UTF8 option (so it thinks its not UTF8)
- Enter UTF8 text in a post
- Use the 2.0 Admin function to convert to UTF8
- Data is corrupted

This is because encrypting data that is already UTF8 to UTF8 causes corruption.  Since SMF2.0 didn't do the UTF8 conversion, $db_character_set & global_character_set do not indicate UTF8.  Thus the conversion function is enabled in the admin menu.  Executing it under these circumstances causes the UTF8 2x encoding corruption issue. 

This will only impact newer 2.0.x installations where the host has set the MySQL defaults to UTF8 (or an admin set it).  I only noticed it because I setup a brand new test environment. 

Example data - before:
From the Tagelied of Wolfram von Eschenbach (Middle High German):

Sîne klâwen durh die wolken sint geslagen,
er stîget ûf mit grôzer kraft,
ich sih in grâwen tägelîch als er wil tagen,
den tac, der im geselleschaft
erwenden wil, dem werden man,
den ich mit sorgen în verliez.
ich bringe in hinnen, ob ich kan.
sîn vil manegiu tugent michz leisten hiez.


Example data - after:
From the Tagelied of Wolfram von Eschenbach (Middle High German):

Sîne klâwen durh die wolken sint geslagen,
er stîget ûf mit grôzer kraft,
ich sih in grâwen tägelîch als er wil tagen,
den tac, der im geselleschaft
erwenden wil, dem werden man,
den ich mit sorgen în verliez.
ich bringe in hinnen, ob ich kan.
sîn vil manegiu tugent michz leisten hiez.

Suggested solution:
- There was a fix applied to SMF2.1 that may be appropriate here - Do not re-encode data already utf8 #3984

Or disable the UTF8 conversion function in 2.0 & let the 2.1 upgrade do it. 
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

shawnb61

Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

shawnb61

I've been experimenting with this, and I have learned something about the 2.0 UTF8 conversion that may be helpful.

When you are running the UTF8 conversion, there is a dropdown called "Data character set".  It you select this to match your column definitions in your DB, data corruption will NOT occur.  If you do NOT select this to match your column definitions in your DB, data corruption occurs. 

So... 
If your columns/tables are latin1-xxxx-xxxx, and you select ISO-8859-1 in Data character set, your data will be OK
If your columns/tables are utf8-xxxx-xxxx, and you select UTF-8 in Data character set, your data will be OK

However...
If your columns/tables are latin1-xxxx-xxxx, and you select UTF-8 in Data character set, your data will be corrupted...
If your columns/tables are utf8-xxxx-xxxx, and you select ISO-8859-1 in Data character set, your data will be corrupted...

Bottom line is to research your tables/columns carefully before running the UTF8 conversion.   You can do this in phpMyAdmin, or you can use a utility like this one to query your actual column definitions:
    https://github.com/sbulen/sjrbTools/blob/master/SMF_UTF8_Diag.php


I still consider this a bug, because your DB & data may already be UTF8 and SMF 2.0 doesn't know it...  But it's good to know that with a proper selection of Data character set, you can tidy things up.
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

shawnb61

I am going to close this one out.

I think in an ideal world the charset would be detected, not prompted.  But I don't think we are going to make that type of enhancement in the 2.0 codebase at this time.  (And it doesn't even apply to 2.1.)
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Advertisement: