• Welcome to Simple Machines Community Forum. Please login or sign up.
December 01, 2021, 04:07:44 AM

News:

Follow SMF on Twitter.


utf-8 conversion error

Started by Krashsite, November 05, 2016, 05:54:20 PM

Previous topic - Next topic

Sir Osis of Liver

While waiting to hear from forum owner, I've again dumped the production db as sql and gz zip, and imported on both his host and mine, with same result, both are damaged.  I've confirmed that he did successfully import a dump of same db yesterday.  No idea how he did it, except that he thought I had fixed it.

Re: the quotes around ich, that's one of the character errors that's causing post text to go missing.  I was originally seeing this - - which won't display here.  When I copy it from Notepad, where it's displayed as in the image, you see it here as "ich", which is how it's displayed on production forum, and in the production database and all imported databases.  Sadly, I've been tinkering with this over so many variations that I can't remember where I saw it with curly quotes, and can no longer find it.  This is what's in the sql dump - “ichâ€.
"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Chen Zhen

Why not just create a secondary query to replace all the special characters you are having issues with?
My previous post in this thread pointed to an example that will fix some of them where you just need to figure out your specifics to replace.
Create a multidimensional array with your specifics, then loop through it to replace what is needed in any relevant table.


My SMF Mods & Plug-Ins

WebDev


"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

Sir Osis of Liver

The problem is to find all the borked characters in 5400+ bilingual topics.  It appears forum owner is now able to export/import the original database successfully using default phpmyadmin settings, he's done it twice.  Have no idea why he can now do it, and I still can't.  Working on it tonight.
"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Sir Osis of Liver

Well, here's something new.  Am now able to export/import db successfully on production forum's host, doing it same way we've been doing it for days, but same dump imports with same problems on my host.  Since the goal here is to move forum to a different host, we're only halfway there.  AFAIK, nothing's been changed on production host.

Production host:  phpMyAdmin - 4.1.14.8    MySQL - 5.5.50-0+deb7u2-log
                         Import has been successfully finished, 864 queries executed.

My host:            phpMyAdmin - 2.8.0.1    MySQL - 5.6.32-78.1-log
                        Import has been successfully finished, 857 queries executed.

"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Chen Zhen


Do you have VPS or dedicated from your host?
Can you update phpmyadmin to something more recent?
From what you just described it sounds as though updated changes/fixes to the phpmyadmin code may have corrected some issues such as this.

Just for kicks

$string = '“ichâ€';
$multiDim = array(
'’' => '’',
'‘' => '‘',
'â€"' => '—',
'â€"' => '–',
'…' => '...',
'â€' => '”',
'“' => '“'
);
$string = strtr($string, $multiDim);


My SMF Mods & Plug-Ins

WebDev


"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

Sir Osis of Liver

We're both running on basic host packages, so not much I can do.  I don't believe phpmyadmin or mysql versions on production host have changed since we started working on this.

Wondering why the import on my host is running 7 fewer queries from same dump.

"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

richardwbb

I wasn't able to understand out of your [previous] words in a two or three of your posts in this topic, where exactly, that mangled character, showed; and; if this was in the exported database sql file or not.

But, I'll take this literally;

QuoteThis is what's in the sql dump - “ichâ€

So is that a exported file for testing purpose, or is it from the new host, or are you stuck with this and does it show the same on screen on; a. the old host and/ or b. the new host?

If you run ' “ichâ€' it will convert this to '"ich' with this utf8-decoder I posted a link of.

Is there a 'œ' lost somewhere; and if so, did you do that, or was that how it showed in that sql file you have there, and if so, or not so, is there a blank space after the second Euro sign?

I recognize all of the above and while not everything wasn't reversible encoding, I recommend Perl or Python. Without for me knowing more about what characters are mangled I can not be sure on where to start repairing and with what [how].

Your best bet [based on what I learned from you in this topic], is to make sure this is a configuration problem or not. It stops here, I wasn't able to figure out on my own, what is what and how stuck you are with that exported sql you have spoken about.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Sir Osis of Liver

That's what's in the dump from production database.  As per my previous post, we're both able to export/import successfully on production host, but not on my host.  Nothing appears to have changed.  My host will not update phpmyadmin, but suggested I can install current version in my account and run it from there.  No clue so far how to configure it, not a lot of time to tinker with it.
"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Chen Zhen

QuoteIf you run ' “ichâ€' it will convert this to '"ich' with this utf8-decoder I posted a link of.

Btw there is supposed to be a hidden character for the curly closing quote which is why it did not show in his post properly.
“ich”  should be "ich"
For testing purposes if you quote the post it may show but copy & paste from the page itself and it will not.



My SMF Mods & Plug-Ins

WebDev


"Either you repeat the same conventional doctrines everybody is saying, or else you say something true, and it will sound like it's from Neptune." - Noam Chomsky

richardwbb

reasoning;

I've been thinking what might have happened to the database you have there and I tried to find an example website decoder/ encoder but I couldn't find any that showed working order. That has a reason I'll explain later on.

I believe that what has happened, that Latin-1 encoded database was imported in a, made aware, UTF-8 SMF.

So, what you see is that; '"' became '“' and that '“' is how '"' gets written in a Latin-1 database, that is storing a UTF-8 character.

This means that, on a UTF-8 aware database, that '“' does not become '"' again. This holds that reason I spoke about, there is no such thing as 'wrong' encoding in a UTF-8 database-file. [For this example, that is] For Linux'ers, it means that iconv can not translate UTF-8 to anything but UTF-8 or up. That is because the computer can't tell that '“' [it stays '“' for this computer] must mean '"'.


to know be forehand;

What might give good result [depends if your database has no damage, and this was a single step misconversion], you will need a MySQL database that is in Latin-1 encoding and a SMF forum that is in ISO-8859-1. [That means, not setting the UTF-8 checkbox, what I suspect that has happened].

Then, in the template, the meta-charset has to be 'utf-8'. What? Yes, because, if that browser, expecting something that must be utf-8, bada-bing, '“' becomes '"' again.

Is there something wrong? Not in this example, based on a non-damaged database-file that has only received a single misconversion. Also, keep in your head, I didn't speak about win-1252 encoding. I do not know if there is another step to take here, but when you have imported that database again in a ISO-8859-1 SMF forum with a Latin-1 talking MySQL, you can make sure yourself that this is over for you, by checking *every* post with [previous] misencoding.

And, if that database receives a Latin-1 encoded file, it can't tell that the data had been stored in a UTF-8 aware forum. Actually there is more I should tell you, for example, if that database you got there still is live, you might have different situation. That depends on what you and your college [might] have done.


my advise;

For starters, try bigdump.php or use PHP import to your target ISP and MySQL and you might want to try putting a '#" before the '$db_character_set = 'utf8';' in Settings.php of the source SMF forum, to see what happens with the '“' <- BUT DO NOT DO THIS ON A LIVE FORUM IF YOU DON'T FULLY UNDERSTAND WHAT YOU ARE DOING <- because SMF will start encoding differently and then you will have mixed encoding and you are stuck with UTF-8 which doesn't seem to be the case right now and then you are luckier then I was with all this. [And you still have to look in to converting to UTF-8 properly but that will be chapter II [or III if something erronous will show up from this point] for you].


  • I hope this is clear to you before you do anything, and if not, feel free to ask me about what I wrote in this post.

If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Sir Osis of Liver

November 22, 2016, 03:53:02 PM #50 Last Edit: November 22, 2016, 04:16:37 PM by Sir Osis of Liver
What's odd is that export/import now works successfully on production host, afaik nothing's been changed.  Last night I moved all of my stuff from my longtime host (20 yrs.) due to the apparent collapse of their support system (enough is enough :P), new host is running phpmyadmin 4.0.10.14.  Will try importing the sql dump there, soon as I clean up a few loose ends.
"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Illori

Quote from: Sir Osis of Liver on November 22, 2016, 03:53:02 PM
new host is running php 4.0.10.14. 

that version of PHP is no longer supported, i would not use a host that uses a php version without support.

Sir Osis of Liver

Sorry, that's phmyadmin 4.0.10.14.  Didn't solve the problem, import is still borked.

"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

shawnb61

Did you run phpinfo?  Do the mbstring settings match on your two environments?
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Sir Osis of Liver

Will check that tonight.  I've now imported the same dump on three different hosts, it only imports correctly on the production forum's host, which it did not do previously (that was the original problem).
"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Sir Osis of Liver

   Matched php settings on my host with production forum's host by downgrading to php 5.5.  Both are configured as follows -

  PHP Version 5.5.38
  mbstring.http_input     pass        pass
  mbstring.http_output   pass        pass
  default_charset           no value   no value

Import still screwed up on my host.

Previous settings -

  PHP Version 5.6.28
  mbstring.http_input    no value   no value
  mbstring.http_output  no value   no value
  default_charset          UTF-8       UTF-8

"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Sir Osis of Liver

Quote from: richardwbb on November 22, 2016, 03:40:13 PM
For starters, try bigdump.php or use PHP import to your target ISP and MySQL and you might want to try putting a '#" before the '$db_character_set = 'utf8';' in Settings.php of the source SMF forum, to see what happens with the '“' <- BUT DO NOT DO THIS ON A LIVE FORUM IF YOU DON'T FULLY UNDERSTAND WHAT YOU ARE DOING <- because SMF will start encoding differently and then you will have mixed encoding and you are stuck with UTF-8 which doesn't seem to be the case right now and then you are luckier then I was with all this. [And you still have to look in to converting to UTF-8 properly but that will be chapter II [or III if something erronous will show up from this point] for you].

Ok, looks like I fixed it.  Settings.php in production forum contains this line -



$db_character_set = 'utf8';



Settings.php in my test forum does not.  If I add the line, posts from imported db display correctly.  Apparently that's been the problem all along.  Don't know why it's in the original forum's settings and not in my test forum settings.

  "Even a blind pig occasionally finds a mushroom."

"The best laid schemes o' mice an' men / Gang aft a-gley." - Robert Burns

Illori

Quote from: Sir Osis of Liver on November 23, 2016, 10:07:38 PM
Settings.php in my test forum does not.  If I add the line, posts from imported db display correctly.  Apparently that's been the problem all along.  Don't know why it's in the original forum's settings and not in my test forum settings.

  "Even a blind pig occasionally finds a mushroom."



then your test forum was never converted to UTF-8, so that line was never added to it.

richardwbb

Quote from: Sir Osis of Liver on November 23, 2016, 10:07:38 PM
Ok, looks like I fixed it.  Settings.php in production forum contains this line -



$db_character_set = 'utf8';



Settings.php in my test forum does not.  If I add the line, posts from imported db display correctly.  Apparently that's been the problem all along.  Don't know why it's in the original forum's settings and not in my test forum settings.

  "Even a blind pig occasionally finds a mushroom."



It looks like that both MySQL configurations are in Latin-1 and it seems so that this setting you menitioned on the production server, never changed [wasn't there], ever. Or there will be mixed encoding. And, for the testing server, it's is to be expected that any post in there with incorrect settings will be overwritten anyway.

You might want to check the exported database anyhow for inconsistencies, but from your responses I sense you won't bother, haha.

Quote from: Illori on November 24, 2016, 06:03:02 AM
then your test forum was never converted to UTF-8, so that line was never added to it.

Illori, can you tell us how a installing SMF forum responds with converting to UTF-8 in regard to this setting?
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Illori

when you convert to UTF-8, that line is added to the Settings.php file, otherwise it does not exist. otherwise i dont know what that line tells SMF as i am not a coder.

Advertisement: