News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

utf-8 conversion error

Started by Krashsite, November 05, 2016, 05:54:20 PM

Previous topic - Next topic

Arantor

I'd personally get a hex editor out and verify what actually is in that dump because I'm suspect of that.

Sir Osis of Liver

Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

Arantor

So the dump is UTF-8 encoded. Seems like whatever is importing the dump is saying the dump is thinking it is ISO which would explain the truncation because if the dump were imported under the guise of being ISO (though internally it isn't) MySQL would try to put it in and choke because it would be expecting bytes that have different meanings.

Sir Osis of Liver

But that returns to the fact that the post is not truncated in the body field, and loads complete into post editor, except for special characters, which are replaced with �.  If it's saved in post editor, missing text is restored to display template.  I've exported/imported the db on production forum's host, same phpmyadmin, same problem, and I've duplicated the glitch in my host account.

Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

Illori

how are you attempting to export the database? phpmyadmin, hosts control panel or ?

Sir Osis of Liver

phpmyadmin, also tried MySQLDumper.  There is no host backup utility, and it's a basic package that doesn't include SSH.

Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

Illori

are you able to run cron jobs? if so you could put in the commandline command there and try that backup.

shawnb61

I suspect a problem with the actual table definition.  I had a very hosed SMF installation & DB.

This may help diagnose:

Using phpMyAdmin, I created a full export, all tables, WITHOUT the data (structure only).  That enabled me to look at how each table was built in the entire DB - now easily readable in Notepad++.  Not only did I have several different collations throughout my SMF install, I even had some tables in different charsets somehow. 

This was the only way I could actually see the charset of each individual table to understand the issues I was having.  I couldn't find it displayed, at the table level, anywhere in cPanel or phpMyAdmin.  These tools all seem to assume your charsets are all consistent... 

To fix the charsets: 

I had to export the problem tables (with data this time, and with the DROP/CREATE syntax) and manually edit the CREATE command to change the DEFAULT CHARSET clause to DEFAULT CHARSET=utf8.  After re-importing the table, everything was fine. 


I had very little multi-char data, it was garbled, and when I was done, it remained garbled.  HOWEVER...   After I did this, all posts going forward were perfect, and I was finally able to enable and provide multi-language support. 

How did I get multiple charsets?  I don't know.  This site was maintained and patched by several people over years, starting with SMF 1.1.   None of whom were around anymore. 

If you're brave (/silly) enough to try this, do it on a local or test DB a few times first...   I created a local working copy of my SMF site specifically for this purpose using EasyPHP.  I also have a test copy on my host.  By the time I did this change in production, I had executed the scripts and tested the results 2 full times. 
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Sir Osis of Liver

#28
I think the basic problem here is that any backup method I use will be based on same mysql function, and that's what's causing the problem.  Mysql seems to be having a problem with collation.  The db exports and imports exactly the same characters, but SMF does not display them correctly.  I can copy/paste text from body field in import db, and this is what you see here -

Saltwater Ich (Cryptocaryon irritans)<br />C. irritans (Cryptocaryon) is a fully ciliated, protozoan parasite present in all saltwater environments, which causes a disease known as marine "ich" or marine "white spot" disease in wild and cultured marine fishes at temperatures between 59 and 86°F.


Display template in this forum is able to display the text in it's entirety, including special characters.  The only discrepancy is "ich".  I was originally seeing in both databases, but now it's changed to "ich" in both, for reasons unknown.

I will check the table definitions.

Edit:  Checked the dump, all tables are imported with DEFAULT CHARSET=utf8.
Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

shawnb61

I'm fairly certain collation doesn't affect display.   Only whether special characters are treated the same for searches, sorts, string compares, etc.    (E.g., whether they're case insensitive or not...)   You do want collation consistent across your DB, but more for consistency/cpu.

Good news that it's all utf8 already.   

First thought...  Are there particular languages impacted?   MySql has a weird partial implementation of utf8...  Some languages need utf8mb4...   

Second thought...   elsewhere in the stack isn't supporting utf8....   Webserver?     Browser?
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Sir Osis of Liver

It's a bilingual forum (english/spanish), but the borked posts I'm working with are english.  Don't understand why the body fields from original and imported databases appear identical, but don't work the same, why post editor can show text that display template cannot, and why I can copy text directly from import db, post it here, and it's correct.

Actually, it's not entirely correct, why is it showing <br /> after the title?

Saltwater Ich (Cryptocaryon irritans)<br />C. irritans


Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

shawnb61

I don't think I picked up...  Is the behavior is different across two websites?   

I would next audit php.ini in both environments to make sure they're consistent. 

And your Apache  httpd.conf as well... 

Both of those config files need utf8 specified...
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Sir Osis of Liver

The original db works fine on production website, but when exported/imported to new db on owner's host or my host and connected to clean 2.0.12 install, posts do not display correctly, missing text and special characters.  Same problems on both hosts, so unlikely it's a server-side problem.  Don't think I can access php.ini on owner's host, and don't want to make any changes there until I know what's causing the problem.  There is currently no way to get an intact db backup, and we've been unsuccessful asking host to copy the db, so there's only the original.

I'm able to determine by trial&error which characters are causing the problems in the posts I've looked at, but it would be very difficult to find/replace all the troublesome characters in 30mb/5400 posts by editing the dump or running sql queries in phpmyadmin.

Arantor seems to think the dump is ok, so the export is good, but something's going wrong with the import, and it's not visible in the imported body field.



Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

shawnb61

I thought smf had a settings dump tool somewhere around here...    I would definitely ensure environment alignment...
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Sir Osis of Liver

Comparing db dump from one of my forums with the dump from problem forum, the latter contains following which are not in former -



SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;



Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

shawnb61

Short version:  If your tables were built correctly, I don't think that code is the problem.   

You should see code like this as bookends to any phpMyAdmin export, as it saves off some environment variables at the beginning and restores them at the end.  The idea is to faithfully import your data without screwing up your environment settings. 

My suspicion is still environment...

The utility that displays your environment info is phpinfo.php.  You need to get a copy & put it in your smf directory.  Run it in both environments.  I would first look at the mbstring section, looking for differences. 
    http://wiki.simplemachines.org/smf/What_is_a_phpinfo()_file

There are a few articles out there that explain configuring your environment for utf8, such as this, that will explain some of the settings:
    https://allseeing-i.com/How-to-setup-your-PHP-site-to-use-UTF8

Hope this helps...
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Sir Osis of Liver

I think maybe we're looking at the wrong end of the horse.  The dump apparently is utf-8, and is correct as per quick sample in hex editor.  It's being imported into same host, same php settings, same phpmyadmin, same mysql, but it's not importing same as db it's exported from, even though they look identical.  I've duplicated the problem on different host.  The basic problem seems to be some incompatible character codes that were inserted by various word processors (possibly even Mac ), and for reasons unknown display correctly from original db, but are borked on import.  But if the dump is valid utf-8, and body fields are identical in original and imported db, why is SMF displaying the text differently?
Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

shawnb61

But if you have confirmed the content matches, byte for byte, it must be how these new environments are configured.   
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

richardwbb

Quote from: shawnb61 on November 13, 2016, 01:38:41 AM
But if you have confirmed the content matches, byte for byte, it must be how these new environments are configured.   

I agree with this.

This I have written somewhere else, please read this;

http://www.simplemachines.org/community/index.php?topic=548245.msg3898297#msg3898297

Can you run with phpMyAdmin SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%'; on both setups?

Please have a look at this and decide if this is helpful to you or not;

http://www.simplemachines.org/community/index.php?topic=548245.msg3898964#msg3898964

I've noticed the encoding on 'ich' and I suspect that MySQL might be encoding this character. It does for ' This character is for begin and end with MySQL but people put that in their post too and that one is encoded differently. For " that character is preceded with \ to make it appear and for " Ã" so it is preceded with Ã

Note that """ is not '"'

For the encoding of 'ich' I've seen it somewhere, that is reversible encoding and you should consider trying a conversion script that is helpful.

For the phpMyAdmin export, avoid this if possible. phpMyAdmin exports are not the same as mysqldump exports. For my host, is running DirectAdmin, this has a menu option mysqldump, that dump is litterally MySQL talking. A phpMyAdmin dump is phpMyAdmin talking.

And for most uses phpMyAdmin is sufficient. But you explained that there are fourty users under the same account and those users must have had different keyboard settings, and I wasn't able myself to drill down on this but it appeared to me that different keyboard harware generates different byte sequences for the same character that is displayed and how and where this is handled is also there where you have to take the proper steps for repair, and for correct configuring. And if phpMyAdmin shows on both setups the same settings, maybe it is possible for you to keep using phpMyAdmin, though I have told you it exports differently and you could ask both providers to get the MySQL response to the charset and collation from there you can tell not what is wrong but you will be able to exclude what isn't wrong.

For the proposal of cron, I've never came around/ nor found the need for this but if you can mysqldump with that, I think you should. Atleast till you know for sure phpMyAdmin is exporting everything correctly. And that is important because it would be hard to learn keyboard setups with characters that are not in your native language. First hit with Google but there are better ones; http://www.comentum.com/mysqldump-cron.html
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Sir Osis of Liver

Thanks for the info, will check through all that soon as I have time.  Meanwhile, forum owner did another export/import yesterday and new database works fine.  Waiting for him to tell me how he imported/exported, as we'd already done this several times, and db was damaged every time.
Ashes and diamonds, foe and friend,
 we were all equal in the end.

                                     - R. Waters

Advertisement: