News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

utf-8 conversion error

Started by Krashsite, November 05, 2016, 05:54:20 PM

Previous topic - Next topic

Sir Osis of Liver

One of my guys is seeing this error when trying to do utf-8 conversion on 2.0.12 forum -



http://domain.org/index.php?action=admin;area=maintain;sa=database;activity=convertutf8

Database Error: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'key key varbinary(255) NOT NULL default '',
CHANGE COLUMN member_name mem' at line 2

Apply Filter: Only show the errors from this file

File: /homepages/139/n285406/htdocs/csa/Sources/ManageMaintenance.php
Line: 666



Error is in conversion script.  Database is already at utf-8.

When in Emor, do as the Snamors.
                              - D. Lister

Sir Osis of Liver

Also having a problem displaying special characters.  Forum was originally at 2.0.8, this character - ® - displayed correctly.  Now at 2.0.12, it displays this - �.  Other characters may be affected.  Collation is utf8_general_ci.
When in Emor, do as the Snamors.
                              - D. Lister

richardwbb

A database in utf-8 will show black diamonds if the character encoding has been wrong. It is possible the data inside the database itself has to be converted with a converter. On linux a quick way to see if a database is encoded properly is to use iconv.

If you have let MySQL import with INSERT INTO, then a black diamond means the original character encoding has been lost.

It is also possible that messages get truncated, then the specific black diamond isn't there, but at the first incorrect encoded character, the rest of that message has been dropped. From the error you posted this doesn't seem to be the case though. Just be aware that exporting and importing from MySQL even within PHP can result in database-losses.

I wonder if your forum has been set with ISO-8859-15 or UTF-8 encoded translations/ English files. SMF treats differently the database. Also, phpMyAdmin will tell you what encoding it sees with the MySQL. If you can get to the MySQL prompt it can tell what encoding is in use [this can differ from what phpMyAdmin is telling].

i would like to know for sure the original database, without conversion, I mean, will be inserted in MySQL without error. There is no point using SMF for conversion, a proper encoded file can be converted to UTF-8 without problems with iconv.

Also, if you have mangled characters in the database or users with different languages or codepages, you will end up with a UTF-8 file MySQL can or can not INSERT properly.

hth.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Sir Osis of Liver

Seeing these - ’ - for apostrophe, and these - ® - for ® in sql dump.  Both are replaced with � when db is imported.  The post is being truncated, apparently at an arbitrary point, most of the text is missing.  Happens when db is dumped as sql or gz zip.
When in Emor, do as the Snamors.
                              - D. Lister

Sir Osis of Liver

This is getting stranger.  We found a post on the clean install running on my host that has several paragraphs missing, just the titles of the first three are strung together in one line.  What's strange is that if we click 'Modify', the complete post loads into the post editor, and when saved, the missing paragraphs are restored.

Any ideas?
When in Emor, do as the Snamors.
                              - D. Lister

Chen Zhen

Did they use the SMF UTF-8 conversion tool/option or a 3rd party MySQL editor such as phpmyadmin?

Either way it sounds as though special characters (ie. apostrophe's) were converted to their literal characters as opposed to HTML character codes.
Similar issue regarding posts as to what is described here: SMF Post
and also possibly what is described here: Stackoverflow Post

There is also a collation tool located here: Webdev Tool ... if necessary.
(adjust smf settings table to utf-8 collation with phpmyadmin and then try using v1.0)

My SMF Mods & Plug-Ins

WebDev

"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

Sir Osis of Liver

AFAIK, the forum was originally set up as utf-8, it's bilingual (english/spanish).  Took a quick look at those links, seems it may be possible to fix the backup.  Can't do anything to the production database, as it is currently the one and only.  The missing text and corrupt characters are all present in the dump, it's complete, just doesn't display correctly, but I can work on that.
When in Emor, do as the Snamors.
                              - D. Lister

lwiz

Might be a long shot but is there variable global_character_set in settings table with value of UTF-8?

-Reko

richardwbb

From what you have written before my post [this], I can tell you, it isn't arbitrary what is going on, and, somewhere is a setting [or settings], not proper. Also, it looks to me that any damage in the database already, has to be converted manually.

From here I wonder what your tools are, do you have linux or Windows. And, what kind of access do you have, can you get to the MySQL prompt or not? [I assume you are using phpMyAdmin and exported databases. I am not sure if phpMyAdmin will report what MySQL says after importing the database, but it can be something like TRUNCATED 'AC813F' <- and hundreds of those. There is not written what post this was. And this problem I had and this one isn't easy to overcome if you don't know what to type to get useable Google results.

I also learn from you that you still have a proper file [I mean by that, non-truncated, non <?> characters in there. I propose you take a text editor and open the dumped sql file. It can be a good idea, if you are able to reproduce [but I not advise this yet as the way to go from here, since I am not understanding what you have done exactly, and more important what happened before this, how late you noticed something is wrong with the database/ setting etc.] But, writing a personal message to yourself, with the characters you are aware of, those end up truncated or mangled, changing settings, reviewing your old message, and, posting the same characters again, for example but not bound to, 'à á ä ò ó ö è é ë ï', you might learn where things break down. And from this point I ask you, if you are able and willing to post a piece of dumped sql from your text editor here. A hint is that you can find the broken part in the dump file by searching a  unique pattern, or just 'é' [but that way you might end up with a part of a old message you have never seen before] and for debugging/ finding the cause of all this, it is better to have found a part in the sql dumped file that you have typed yourself.

And finally, please explain what you are doing now to prevent further damage. I learn from what you have written, that, you are aware of the truncation and I am not 100% sure yet but you will have to repair manually what already is damaged and this can be by a click of a script if you are lucky and if you are not so lucky, then you still need a text-editor and proper conversion tools and quite some time to spend probably.

Quote from: Sir Osis of Liver on November 06, 2016, 10:40:52 PM
Seeing these - ’ - for apostrophe, and these - ® - for ® in sql dump.  Both are replaced with � when db is imported.  The post is being truncated, apparently at an arbitrary point, most of the text is missing.  Happens when db is dumped as sql or gz zip.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Arantor

’ is a classic sign of a two byte UTF-8 character (i.e. data is physically UTF-8) being interpreted as ISO. When the DB is imported, you're taking that data that is now thought to be ISO and trying to put it into a UTF-8 table. Those bytes are not valid UTF-8 if it's gone through the process of being thought of as ISO because they end up decoded to high-bit characters which when they go back into MySQL look like the second byte of a UTF-8 character without a preceding first-byte of a UTF-8 character, hence the truncation.

Step 1: what is the actual encoding of the data as far as the database is concerned - not what SMF thinks it might be? (Because SMF can and does get this wrong depending on a variety of factors)
Holder of controversial views, all of which my own.


Sir Osis of Liver

#10
The production database is fine, everything displays correctly.  The database dump contains all missing text, with scrambled codes for special characters.  When imported, it does not display special characters, and blocks of text are missing in display template, but are included in post editor.  In phpmyadmin, the production database shows all tables as utf8_general_ci.  The bottom (Totals) line shows latin1_general_ci.  I don't see any way to get to MySQL prompt, this is a basic hosting package with few bells and whistles.  I can edit the dump, there's a script in one of the links posted above that replaces the borked codes with html character codes, but don't know if it will get all of them.  Will try that tonight (if I don't run out of beer).

Quote from: Sir Osis of Liver on November 09, 2016, 09:12:11 PM
Ran SHOW FULL COLUMNS FROM smf_members, all text fields are utf8_general_ci, all int fields are NULL.  Same in smf_messages.
When in Emor, do as the Snamors.
                              - D. Lister

Sir Osis of Liver

#11
This is what's in display template -



Saltwater IchSymptomsIdentification



This is what's in post editor -



[b]Saltwater Ich[/b] (Cryptocaryon irritans)
Cryptocaryon irritans (Cryptocaryon) is a fully ciliated, protozoan parasite present in all saltwater environments, which causes a disease known as marine �ich� or marine �white spot� disease in wild and cultured marine fishes at temperatures between 59 and 86�F. Cryptocaryon infections in fish are a significant disease problem for marine aquarists and commercial mariculture worldwide. Cryptocaryon is known to infect many different fish species, although there appear to be differences in susceptibility.



This is what phpmyadmin shows in body field -



[b]Saltwater Ich[/b] (Cryptocaryon irritans)<br />Cryptocaryon irritans (Cryptocaryon) is a fully ciliated, protozoan parasite present in all saltwater environments, which causes a disease known as marine "ich" or marine "white spot" disease in wild and cultured marine fishes at temperatures between 59 and 86°F. Cryptocaryon infections in fish are a significant disease problem for marine aquarists and commercial mariculture worldwide. Cryptocaryon is known to infect many different fish species, although there appear to be differences in susceptibility.



This is what's in the dump -



[b]Saltwater Ich[/b] (Cryptocaryon irritans)<br />Cryptocaryon irritans (Cryptocaryon) is a fully ciliated, protozoan parasite present in all saltwater environments, which causes a disease known as marine “ich” or marine “white spot” disease in wild and cultured marine fishes at temperatures between 59 and 86°F. Cryptocaryon infections in fish are a significant disease problem for marine aquarists and commercial mariculture worldwide. Cryptocaryon is known to infect many different fish species, although there appear to be differences in susceptibility.




The script in Chen Zhen's link does not work because it doesn't find the borked character codes.  They're not in the imported database message body, characters are correct there, but they're in the dump.  Basically, phpmyadmin is telling me the characters have imported correctly, display template has lost entire paragraph, post editor shows the paragraph with special characters missing.


When in Emor, do as the Snamors.
                              - D. Lister

Sir Osis of Liver

Quote from: lwiz on November 09, 2016, 12:15:30 PM
Might be a long shot but is there variable global_character_set in settings table with value of UTF-8?

Yes there is.
When in Emor, do as the Snamors.
                              - D. Lister

Sir Osis of Liver

#13
This is what breaks it - should be "ich", but in body field it's "ich", in dump it's “ichâ€.

In the original production database, it's "ich".  Nothing wrong with the dump or import, it's being exported/imported exactly as it is on production forum, except production forum displays it correctly as "ich", imported db breaks it.  Both running 2.0.12, on different hosts.

When in Emor, do as the Snamors.
                              - D. Lister

Arantor

Then the dump is broken somehow. How exactly are you checking the dump's correctness?
Holder of controversial views, all of which my own.


Illori

so http://www.simplemachines.org/community/index.php?topic=549809.0 and this topic are the same issue/topic? if so please select one topic and lock the other so we dont get confused as to what is going on and what has been done where.

Sir Osis of Liver

The dump is importing exactly the same characters from production db, as viewed in phpmyadmin.  Post displays correctly in production forum, but not in import forum.  Same characters in body field.  This happens on both hosts.

Quote from: Illori on November 10, 2016, 05:11:02 AM
so http://www.simplemachines.org/community/index.php?topic=549809.0 and this topic are the same issue/topic? if so please select one topic and lock the other so we dont get confused as to what is going on and what has been done where.

Locked the other one.
When in Emor, do as the Snamors.
                              - D. Lister

Sir Osis of Liver

Production db body field -

Saltwater Ich (Cryptocaryon irritans)<br />C. irritans (Cryptocaryon) is a fully ciliated, protozoan parasite present in all saltwater environments, which causes a disease known as marine "ich" or marine "white spot" disease in wild and cultured marine fishes at temperatures between 59 and 86°F.

Imported db body field -

Saltwater Ich (Cryptocaryon irritans)<br />C. irritans (Cryptocaryon) is a fully ciliated, protozoan parasite present in all saltwater environments, which causes a disease known as marine "ich" or marine "white spot" disease in wild and cultured marine fishes at temperatures between 59 and 86°F.

They are identical.  If I replace " and " with ", it breaks at °.
When in Emor, do as the Snamors.
                              - D. Lister

Arantor

Something isn't identical between the two, clearly.
Holder of controversial views, all of which my own.


Sir Osis of Liver

#19
Yes, but how to find it?

This is an unusual forum, 30mb content consists of 5400 technical articles posted in 5400 topics by approx. 40 contributors, all using one admin account (the only member), over 2 1/2 years.  Most, if not all, of the articles were probably written in various word processors.  It's also possible some of the posts were copied/pasted from several WP installs the owner also runs.  Everything works fine from production database, but it can't be backed up or moved because exports are somehow damaged.
When in Emor, do as the Snamors.
                              - D. Lister

Advertisement: