My ISO-8859-1 / UTF-8 woes (for the last time, then I shut up about it)

richardwbb · February 27, 2013, 09:00:10 AM

Hello,

I've been working on a 1.1.16 to 1.1.18 upgrade in a test environment, because of ABRP I noticed that Register.php and Reminder.php in Sources dir sticked being 1.1.11 according to SMF version check in Admin.

Investigating further, I noticed in my test environment updating to 1.1.17 worked, to 1.1.18 from 1.1.16 or 1.1.16 > 1.1.17 > 1.1.18, al forum went to blank screen, no error output to be found.

I am not sure why so, but I decided to upgrade to 1.1.18 bij installing it from scratch and using notepad.exe for a step by step personal howto.

I noticed the installer in 1.1.18 is a little different.

Anyway, I post as less as possible in SMF support, karlbenson, Lex and the person with the cat with baseball cap continously show up in my searches

So I only have two mods in my 1.x forum, I kept it low, since I feel the need to go to 2.x too. That is not so hard but the template is. So am telling myself every possible excuse to do something else first

However, I again ran in to euro sign and other special character issue, only this time I also found how to fix this myself, understanding what to do, but I do not understand what is going on.

Assuming SMF is right and I am not, I still don't get it, that when I install a UTF-8 forum (option set in installer.php by me) why this setting in Settings.php : $db_character_set = 'utf8'; is there.

So I export the forum DB, import it to another DB, the export is UTF-8.

Then the euro sign gets malformed, just as the letter è, which means all special characters got malformed. The template is saying meta = utf-8.

But, it is not the database itself that is malformed, as far as I can understand, in the way I export, this is not even possible.

So I have to adjust Settings.php to: #$db_character_set = 'utf8';, so that setting isn't actually read by SMF any longer.

Now all characters show fine in the browser.

I hope everyone is able to follow what I have done. I know it works for me, but I wonder, since I am a perfectionist, am I doing something wrong here?

Can anyone explain to me why in Settings.php this utf8 option is default on, while it mangles my forum output?

Are you sure I am the only one running in to this? I Googled for other SMF fora and many of them seem to have mangled characters so it seems those admins aren't aware of this. I wasn't either, our users had to complain to make me solve this.

I hope my €0.02 is worthwhile and hoping to learn here, what is proper.

kat · February 27, 2013, 10:52:34 AM

Just curious (Said the k@ in the c@p)...

If you go to phpmyadmin and select your database, what does it say, under "Database server"?

Particularly, the "Server charset:" thing.

richardwbb · February 28, 2013, 09:47:43 PM

Ok, I post you the details, first I want to say I just figured out the 1.1.11 issue, I have written this in Dutch in the Dutch part of this forum, but I am happy to rewrite of necessary. I feel it does confuse my first post a bit. But it seemed to be a mod issue (I am not sure here) but I did notice smf_settings saying 1.1.11 database yesterday. Is fixed now. It does not change the character set issue thingy

Server: Localhost via UNIX socket
Software: MySQL
Software version: 5.5.30-log - MySQL Community Server (GPL)
Protocol version: 10
User: xxxxxxxx@localhost
Server charset: UTF-8 Unicode (utf8)

server connection collation= utf8_general_ci

my export says: Character set of the file: utf-8

Maybe the following also will help, I will post the characters the browsers shows with '$db_character_set = 'utf8';' in Settings.php

€ = â,¬
ë = Ã«

(I wrote é in first post but I meant ë)

MrPhil · March 01, 2013, 09:27:10 AM

If your database is 100% UTF-8, do characters like the Euro (€) display as multiple Latin-1 characters even in phpMyAdmin (browsing the tables)? [check that phpMyAdmin itself is displaying in UTF-8] If they do, then there was a screwup somewhere in converting or importing the database. The multiple bytes of a UTF-8 character got read in as multiple Latin-1 accented characters, each of which in turn were converted to multiple byte UTF-8 characters! If it's just a few characters here and there, manually editing the text might be the easiest thing. If it's massive corruption, it might take a custom utility to "condense" down a .sql backup from sequences of UTF-8 multibyte characters back to Latin-1 single byte characters, which would then be proper UTF-8. You might try a scratch Latin-1 database to import the UTF-8 backup as UTF-8 source -- maybe it will let you do that if there are no characters invalid for Latin-1. If you're lucky, the first level of UTF-8 characters have been condensed back down into single Latin-1 bytes. Then you back up that scratch database (as Latin-1) and import (restore) it into a UTF-8 database as a UTF-8 backup. Play with this on a couple of scratch databases and see if you can end up with decent UTF-8 in the end.

richardwbb · March 01, 2013, 10:08:05 AM

Ok, I took a step by step approach with you advice.

I looked up an Euro sign of a message I know wasn't mangled. Within phpmyadmin it is displayed as: â,¬

But there are indeed messages that stayed mangled, I am sure of wrong import, taking the DB for say two weeks in to production and noticing the issue way too late to do anything about it. I mean, I think it is not possible to fix that or/ and I don't know how. This is over a year ago now. I did import back and forth back then and have no good description of what I exactly did back then.

So I looked up a old message, this time it is the ' character:
PersonenautoÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ÃƒÂ¢Ã¢â,¬Å¾Ã,Â¢s

With this long row of characters, I beleive I imported a little too happy so to say. Really because not understanding I mangled the database even more.

I am willing to try to attempt a manual repair as you describe, but before I start, because I assume that will take quite some time, I ask you first if you can tell if this would help me at all?

You wrote: If you're lucky, the first level of UTF-8 characters have been condensed back down into single Latin-1 bytes, my knowledge/ understanding is not high enough to be able to tell if a repair attempt would help. I can be a DB juggler though, let me know!

MrPhil · March 01, 2013, 12:04:27 PM

OK, so it sounds like you did manage to import a UTF-8 backup as "Latin-1 source"* into a UTF-8 database. I would create a scratch (sandbox) database to experiment with.

* IIRC, Latin-1 is the default encoding assumed on IMPORT. You may have simply overlooked that.

Export current database to .sql backup file.
Create a sandbox database, (default) Latin-1 encoding
Import the backup file, declaring it to be UTF-8. I can't promise that phpMyAdmin won't reject it, either because it won't try to convert UTF-8 to Latin-1 in general, or because it finds some character that's not legitimate UTF-8 (such as a real accented Latin-1 character). If it succeeds, browsing the database in phpMyAdmin should still show (shorter) sequences of accented characters.
Export (back up) this database to a new .sql file
Create a new sandbox database, this time UTF-8
Import the second .sql backup, declaring it to be UTF-8
If browsing in phpMyAdmin shows clean data, you succeeded.
Back up your current forum database, just in case
Create a new forum database, this time entirely UTF-8 encoding
Import the second .sql backup, declaring it to be UTF-8, and hopefully you'll be in business

It's very possible that phpMyAdmin will stop you when you try to import UTF-8 into a Latin-1 database. If that happens, are the problem areas small enough to be worth manually editing? You would need to find a translator tool that you could cut and paste the corrupt text into (as UTF-8) and translate the bytes to UTF-8 bytes, and then cut and paste that as raw bytes (Latin-1 encoded) and get UTF-8 output. If that produces the correct text, you would cut and paste it back in to the forum. I don't know about such tools, but if you search around, I wouldn't be surprised if someone has written one.

The last resort would be to write a utility to read an .sql backup of your forum, read it in (binary) as Latin-1 characters, find sequences of bytes that make up valid UTF-8 characters that are found in the Latin-1 character set, and hold that single Latin-1 byte in a buffer. Repeat for succeeding bytes until you end up with a valid multibyte UTF-8 character. Anything that's not a valid Latin-1 character you leave alone as something presumably entered as UTF-8 in the first place. Anything that's not a valid UTF-8 character is presumably a native Latin-1 character that needs to be translated to its UTF-8 multibyte equivalent. The idea is to end up with a purely UTF-8 file that can be imported to replace your mixed file. It's a lot of work, but could be done (I would have to charge a consulting fee to write such a thing).

richardwbb · March 09, 2013, 12:35:46 PM

Whoa!

I had to read your post several times to understand it. I think I can manange the db export/ import test/ repair/ attempt, but since I knew that I would need some time and dedication to do this properly and in one session, I had to wait to this weekend. That is why I did not reply yet.

I would not have replied today either if it where not so long ago already and I can tell it took some of your precious time.

Now my ISP is acting up again, it's MySQL DB export became mangled, I am receiving a 200 bytes file all the time while it should be 17MB, I have contacted their support.

So this message is only telling that I will get back to this. I am sure I am not able to write my own script, I can follow your thoughts but I am a PHP n00B

I am terribly sorry I am not able to verify your steps to see what this will do or where it will say error!

I am still wondering why the dbcharacter setting in Settings.php makes my forum show mangled instead of vice versa....

When my ISP has fixed things, probably another Direct Admin update, this 200bytes zip error I have had before with them, I will let you know my findings.

thanks!!!!

MrPhil · March 09, 2013, 05:35:11 PM

I appreciate your responding, even if you do not have news yet. So many people take advice and then you never hear from them again, success, failure, or just a simple "Thank you".

If you did what I think you did (import a backup in the wrong character set), you've got your work cut out for you. First thing is to get your database backup and restore methods nailed down, which is something you and your host will have to come to grips with.

richardwbb · March 17, 2013, 05:48:46 PM

Quote from: MrPhil on March 01, 2013, 12:04:27 PM
OK, so it sounds like you did manage to import a UTF-8 backup as "Latin-1 source"* into a UTF-8 database. I would create a scratch (sandbox) database to experiment with.

* IIRC, Latin-1 is the default encoding assumed on IMPORT. You may have simply overlooked that.

Export current database to .sql backup file.
Create a sandbox database, (default) Latin-1 encoding
Import the backup file, declaring it to be UTF-8. I can't promise that phpMyAdmin won't reject it, either because it won't try to convert UTF-8 to Latin-1 in general, or because it finds some character that's not legitimate UTF-8 (such as a real accented Latin-1 character). If it succeeds, browsing the database in phpMyAdmin should still show (shorter) sequences of accented characters.
Export (back up) this database to a new .sql file
Create a new sandbox database, this time UTF-8
Import the second .sql backup, declaring it to be UTF-8
If browsing in phpMyAdmin shows clean data, you succeeded.
Back up your current forum database, just in case
Create a new forum database, this time entirely UTF-8 encoding
Import the second .sql backup, declaring it to be UTF-8, and hopefully you'll be in business
It's very possible that phpMyAdmin will stop you when you try to import UTF-8 into a Latin-1 database. If that happens, are the problem areas small enough to be worth manually editing? You would need to find a translator tool that you could cut and paste the corrupt text into (as UTF-8) and translate the bytes to UTF-8 bytes, and then cut and paste that as raw bytes (Latin-1 encoded) and get UTF-8 output. If that produces the correct text, you would cut and paste it back in to the forum. I don't know about such tools, but if you search around, I wouldn't be surprised if someone has written one.

The last resort would be to write a utility to read an .sql backup of your forum, read it in (binary) as Latin-1 characters, find sequences of bytes that make up valid UTF-8 characters that are found in the Latin-1 character set, and hold that single Latin-1 byte in a buffer. Repeat for succeeding bytes until you end up with a valid multibyte UTF-8 character. Anything that's not a valid Latin-1 character you leave alone as something presumably entered as UTF-8 in the first place. Anything that's not a valid UTF-8 character is presumably a native Latin-1 character that needs to be translated to its UTF-8 multibyte equivalent. The idea is to end up with a purely UTF-8 file that can be imported to replace your mixed file. It's a lot of work, but could be done (I would have to charge a consulting fee to write such a thing).

Ok, step 1, done, using Direct Admin export, it is not a phpmyadmin export any longer but a MySQL (-- MySQL dump 10.13 Distrib 5.5.30, for Linux (i686)) one. Checked database, no errors, no differences because of being forced to use this different export, I am just checking everything since this special character boo boo of mine, it default drops tables if exist, which the phpmyadmin export does not. Also checked for special characters, showing properly (the new ones that is

)

Step 2 done, I had to use Direct Admin to create the database, no privileges or cPanel in phpmyadmin, then looked under operations in phpmyadmin, it says collation = latin1_swedish_ci.

Step 3, bummer, there is this flaw of my ISP again. Now I use bigdump to import but that could have been a culprit, nasty setting below other settings, the important one, on line 34, my laptop shows up to line 33, $db_connection_charset = 'utf8'; I assume you know bigdump. However, this incorrect usage of bigdump, default bigdump download so to say, says ISO-8859-something, probably was where I messed up my database. I also assume this way of importing is no different then what you are saying. It does show what you predict, those mangled characters still being there, but that is also what I am used to with bigdump. The ISP error:

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 69704559 bytes) in /var/www/html/phpMyAdmin-3.5.7-all-languages/libraries/zip_extension.lib.php on line 65

Readers might want to skip this paragraph, it is not about my problem, it is about daily practice and a admin beware! info:
Is basically the contents of the zip file that phpmyadmin exports, the gzip export compresses partly, then adds uncompressed (I read MySQL language, halfway the .gz file and after two weeks of e-mailing, ISP at the moment claims the .zip option does not work but why repair that for me, since the .gz compresses perfectly. (with a wrong bytecount showing in WinRAR), told them two times now, it does not decompress but guess what, ISP acting half arsed, even claiming I am using WinRAR and therefore stupid or something. Sorry for the confusion, database size is being below 70MB, 134217728 bytes = 128MB...... what it says with tried to allocate, probably is my database size. Direct Admin however exports.... But this export has wrong file extension, not a problem in this case, but extra work for me, thank you not ISP dummy! Learning moment, as being site admin of SMF forum, if my user complains, I use dummy login with his rights on the forum, not my admin rights, saves me a lot of stupid question I ask to my precious user....

Now I try to teach my ISP this. Lazy bums, learn to open 200 bytes zip files to read it's contents, same goes for strange .gz exports and decompress them before saying compressing went well. So here my ISP annoyance, I did not expect finding a way to put it in this topic

But: site admins beware, if you never exported/ backupped a database, it is good practice to learn to copy your forum, use altered database in Settings.php, check the special characters, learn a disaster recovery, since no one can trust his ISP... I am a shared hosting user though, not a virtual server one. You also might have to use bigdump which you can find with Google. Easy to mess things up! Especially with a dead forum and complaining/ unhappy users at hand.... or no db backup.

Anyway, step 3 using bigdump.php with setting on line 34: $db_connection_charset = 'utf8'; (if you start using bigdump too, make this setting proper or you WILL end up like me and you don't want that

routinley checked special characters, new ones show good, old ones stayed the same. (string found in SMF search, exactly the same, what is posted 4 posts ago in this thread:

PersonenautoÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ÃƒÂ¢Ã¢â,¬Å¾Ã,Â¢s

Step 4, exported with Direct Admin again and uploaded it back to the www server, behaviour by design with exporting db's on shared hosting with phpmyadmin or in my case Direct Admin and no secure shell access.

Step 5, completely deleted testdatabase, recreated it with Direct Admin, in phpmyadmin under operations selected utf8_swedish_ci, clicked go.

Step 6, again a bigdump import with the utf8 setting.

Step 7, No change, string is still: PersonenautoÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ÃƒÂ¢Ã¢â,¬Å¾Ã,Â¢s

Step 8, 9 and 10 skipped, atleast I get to skip something

Okay, this did not work. Your paragraph after your ten steps plan, is beyond me. The paragraph after that, is probably beyond our foundations budget, it is a forum for a car club.

However, sorry for the lengthy write up, I tried to be as clear as possible and possibly I did not execute step 1 to 7 properly.

I already thank you very much MrPhil, you have given me more insight in understanding what is going on in my db, learned something about phpmyadmin, gave me the opportunity to behindwoop my ISP (again), I also have written my own disaster recovery, I can set up this forum from scratch with forum directory backup and db backup under ten minutes now and maybe, just maybe you can give me another thing to try to rid those mangled characters, completetly removing them without restoring the original character is still better then leaving this as it is, so I also can boggle my mind on this, how to approach this, I can do a lot with computers but programming is not one of them, I tried, I can program, I always completely forget what I learned before, so it's all at Hello World! level

Laters!

MrPhil · March 17, 2013, 07:46:52 PM

Sorry you couldn't get it to work, but it sounds like you gave it the ol' college try. If the export/import list I gave didn't work, I don't know what else will, short of writing the utility or manually editing the database (or .sql file) if you can figure out/guess what the text is supposed to be.

I'm pretty busy with a more-than-full-time job, but if I should happen to try to write such a utility as I described, I'll let you know. Don't wait up for it

. People do from time to time end up with strangely mixed encodings in their database, so if I can think of a good algorithm to untangle such things, I might try it.

richardwbb · March 17, 2013, 07:57:45 PM

Ahh yes, I know the feeling, I also have the luxury of being able to pick from everything that could be done today

I guess that is a good thing.

However, I attempted to search and replace with the .sql in a text editor, I soon noticed that was tricky, many chances of damages the .sql myself by typing error and the mangled characterd did show different versions too, I mean, for a specific character, multiple combo's of mangled chars found.

I can guess the proper character though, it seems only to be the Euro sign, and umlaut and ' on chars, Dutch seems not to use anything else or if someone did use those characters, it would be marginal. Now some searches on the forum shows those characters here and there and users kept informing me about it, which means that makes them unhappy I suppose.

But, is there a way to rid all those mangled characters and not putting back the original special character?

I understand there is no way back, if there would become script available that solves it. I can be sure it won't be me going Eureka tonight.

or ever on this

MrPhil · March 18, 2013, 09:18:09 AM

Maybe you can find an encoding conversion/translation utility somewhere that would let you paste in 2 or 3 characters (in Latin-1 encoding) and it returns the equivalent UTF-8 character, which you would then paste back into the original source (replacing the 2 or 3 characters). Rinse, repeat

. You would have to be careful that the cut and paste actually transfers the characters as Latin-1 and not UTF-8 (the clipboard may contain both versions).

If you didn't get any error messages from MySQL that it couldn't/wouldn't import UTF-8 into a Latin-1 database, I have to wonder if you made a mistake somewhere in the process? Did it just override your encoding selection and import as UTF-8?

QuotePersonenautoÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ÃƒÂ¢Ã¢â,¬Å¾Ã,Â¢s

Actually, I'm a bit surprised that a European language (Dutch?) would have such a long run of accented characters, without a few plain ASCII characters here and there... Do you have any idea what kind of text this was originally, or if it was supposed to be a lot of special characters (not necessarily accented Latin alphabet/real words)?

Add:
I found a byte-to-UTF-8 translation utility: http://ltg.ed.ac.uk/~richard/utf-8.html. Select "Hex UTF-8 bytes" for "Interpret as", and enter hex bytes with spaces between them: C3 83.

I suspect you've got some CP-1252 (Smart Quotes) mixed in (not pure Latin-1). Some of the characters are not found in Latin-1. I may have misread some of them, but I do get legal UTF-8 characters. It's amazing how a single character can expand to a mess!

Code Select


Your UTF-8 string:
  PersonenautoÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬ÃƒÂ¢Ã¢â,¬Å¾Ã,Â¢s 
Assuming UTF-8 byte encoding of actual CP1252 string (UTF-8 backup read in as CP1252/"Latin-1"):
  Personenauto C383 C692  C382 C2A2 C383 C2A2 C3A2 E282AC C5A1     C382
                         C2AC C383 C2A2 C3A2 E282AC C5BE     C382 C2A2 s
(I have grouped the CP1252  character bytes into valid UTF-8 characters)
Back into UTF-8:
   Personenauto Ã      f-hook Â       ¢       Ã        ¢       â       Euro      s-caron Â
                          ¬      Ã        ¢       â       Euro      z-caron Â        ¢       s 
As CP1252 codes, grouped as UTF-8 characters:
   Personenauto C383 C2A2 C3A2 E2809A         C2AC C3A2 E2809E          C2A2 s
As UTF-8:
   Personenauto Ã        ¢       â       low-9-'quote ¬       â        low-9-"quote ¢       s
CP1252 Byte codes grouped as UTF-8:
   Personenauto C3A2           E282AC                           E284A2                            s
UTF-8 characters:
   Personenauto â    (Euro sign)  (tm sign)  s
CP1252 Byte codes grouped as UTF-8:
   Personenauto E28099 s
UTF-8 characters:
   Personenauto's      (Smart Quote ')

That suggests that you actually had several iterations of importing UTF-8 backups as Latin-1/CP-1252, and your original source was CP-1252 (Word editing).

richardwbb · March 19, 2013, 11:39:27 AM

MrPhil, you are a master!

I will have to look in to this, again want to do this when I have the time at hand, but I do remember seeing 1252 coding, I have gotten this forum of someone, who gotten it from someone.

last someone, I don't even know who it was, the other someone was kinda like here you go. *click, beep beep beep*

Then, the ISP was such one with a single support person, it was also expensive ISP and blocking my phpmyadmin export juggling and denying things blatantly.

They said they gave me a UTF-8 export (couldn't do it myself), of course I threw away that file.....

However, also a former site admin, not knowing what FTP is and could not backup database, that is where I said, but I can! had been clicking the convert to UTF-8 within SMF without a backup. I did not even try to click that button, maybe I should do that on a testforum.

The origininal word is:

Personenauto's

This is a €:

ÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â€šÂ¬Ã...Â¡Ãƒâ€šÃ,Â¬49,-

This is a ë:

BelgiÃƒÆ'Ã†â€™Ãƒâ€šÃ,Â«

I will play with the tool you posted and I should make another manual attempt, to estimate how much work it is and see what I find, from last time I remembered it was huge but also, I did not know about database structure at all.

MrPhil · March 19, 2013, 01:00:59 PM

If it's just a few patterns that happen over and over, you might be able to find them and replace them fairly quickly (possibly with cut and paste into a phpMyAdmin SQL query to UPDATE the 'body' text everywhere -- look for instructions on updating domain names in posts). Make sure that you do the longest patterns first, especially if they are supersets (extensions) of shorter patterns. Amazing how a single character can grow into 30 or 40 characters, isn't it? I know -- we can refer to such things as Tribbles!

The post text was edited in Word (or maybe Outlook), which substituted "typographically pleasing" Smart Quotes for ordinary apostrophes, etc. The Euro sign is also in that set. When it was pasted into a UTF-8 encoded forum, these single byte Smart Quotes were replaced by the same UTF-8 characters (multibyte). Ditto for accented Latin alphabet characters (real Latin-1). Export (back up) gave 2 or 3 characters (bytes) each. When the backup was read in as "Latin-1" or "CP-1252" rather than "UTF-8", each of the 2 or 3 bytes was interpreted as Latin-1/CP-1252 single byte accented characters, each of which needed to be translated into multibyte UTF-8 characters. Repeat until you go crazy.

richardwbb · March 19, 2013, 05:31:41 PM

Quote from: MrPhil on March 18, 2013, 09:18:09 AM
Add:
I found a byte-to-UTF-8 translation utility: http://ltg.ed.ac.uk/~richard/utf-8.html. Select "Hex UTF-8 bytes" for "Interpret as", and enter hex bytes with spaces between them: C3 83.

I wrote that I wanted to find some spare time to have a good look at this, but I'm already stuck, I found the translation utility and did what you wrote. Then I see a A TILDE if I remember correctly.

I do not understand what this would tell me besides this looks reversed to me. Or in other words, I have the feel I need to know HEX codings of all mangled characters?

Then your code part, I understand globally what you have been doing there, but I am not able to repeat. It looks to me you manually worked your way trough this lengthy example mangled character 'Personenauto's'?

I find it remarkable you where able to work all the way back to the original character, that is proving it is possible to repair everything.

But I would not know where to start with this utility, I've been inputting several characters and things that I could make up so to say, but nothing seemed to give a result that seemed like output. (that is how I battle PHP and CSS too, the lengthy + huah + retry multiple very often way)

Quote from: MrPhil on March 19, 2013, 01:00:59 PM
If it's just a few patterns that happen over and over, you might be able to find them and replace them fairly quickly (possibly with cut and paste into a phpMyAdmin SQL query to UPDATE the 'body' text everywhere -- look for instructions on updating domain names in posts). Make sure that you do the longest patterns first, especially if they are supersets (extensions) of shorter patterns. Amazing how a single character can grow into 30 or 40 characters, isn't it? I know -- we can refer to such things as Tribbles!

Tribbles? The Star Trek ones? LOL! I seem to be a Klingon in that perspective. Or a fruitfly (http://en.wikipedia.org/wiki/Tribble). I remember that episode

However, I believe I can update manually now via SQL query's my approach was Notepad++ search and replace. But I am confused on how I can be sure I do the longest patterns first and what if 2 different patterns have equal length?

Quote from: MrPhil on March 19, 2013, 01:00:59 PM
The post text was edited in Word (or maybe Outlook), which substituted "typographically pleasing" Smart Quotes for ordinary apostrophes, etc. The Euro sign is also in that set. When it was pasted into a UTF-8 encoded forum, these single byte Smart Quotes were replaced by the same UTF-8 characters (multibyte). Ditto for accented Latin alphabet characters (real Latin-1). Export (back up) gave 2 or 3 characters (bytes) each. When the backup was read in as "Latin-1" or "CP-1252" rather than "UTF-8", each of the 2 or 3 bytes was interpreted as Latin-1/CP-1252 single byte accented characters, each of which needed to be translated into multibyte UTF-8 characters. Repeat until you go crazy.

Can you imagine how I felt when I had to tell my users that I f'd the thing up? Only this one I could not repair

But I can agree to what you wrote, all the signs are there (no pun intended)

I still have the assumption characters have different mangles in the db and that I would not know how to ensure myself I started with the lengthiest. I assume making an error in this will end up with a proper character, but the wrong one?

Like Personenauto's then becomes Personenauto€s?

If I do that once, I fear another problem, how can I ensure myself db will not error the forum to strangeness or not being accesible, because I am not able to check all messages and places I changed, manually? It are 113 thousand posts and one thing I learned, mending the db is great, but next day the users happily post and if I then find something I did wrong, it is there or I have to throw away 24 hours of posts so to say.

I will continue though, but this time really with spare time on my hands

MrPhil · March 19, 2013, 07:56:54 PM

Quote from: richardwbb on March 19, 2013, 05:31:41 PM
I wrote that I wanted to find some spare time to have a good look at this, but I'm already stuck, I found the translation utility and did what you wrote. Then I see a A TILDE if I remember correctly.

I do not understand what this would tell me besides this looks reversed to me. Or in other words, I have the feel I need to know HEX codings of all mangled characters?

Yeah, more or less. I looked up the (single byte) hex codes for the funny characters in a CP-1252 chart and entered them into the utility. Try 2 bytes (separated by a space). If it tells you it needs more bytes, add the next one (all were 2 or 3 bytes). You'll end up with the UTF-8 character those two or three funny CP-1252 characters represented. Note that , in your example was really a "low 9 single quote", not a comma.

Quote
Then your code part, I understand globally what you have been doing there, but I am not able to repeat. It looks to me you manually worked your way trough this lengthy example mangled character 'Personenauto's'?

I find it remarkable you where able to work all the way back to the original character, that is proving it is possible to repair everything.

I was amazed too when it finally resolved to a single apostrophe (high 9 closing single quote).

Quote
But I am confused on how I can be sure I do the longest patterns first and what if 2 different patterns have equal length?

Lots of hard work, I'm afraid. The idea is to find patterns like GHABCDIJ and make sure you do those longer ones before you do the shorter ones that it's a superset of, like ABCD. If you changed ABCD first, you would end up with GHaIJ, which would then be "ruined" for you (you'd have to start over). Or, you could remember that ABCD->a and edit a->ABCD back into these sequences.

Quote
Like Personenauto's then becomes Personenauto€s?

Yeah, that could happen if you're not careful. That's why you take backups (and know how to restore them!) before fooling with the database.

Quote
mending the db is great, but next day the users happily post and if I then find something I did wrong, it is there or I have to throw away 24 hours of posts so to say.

Well, once you find what you think is the longest mangled streak, you put the forum in maintenance mode ($maintenance = 1), make one global change in phpMyAdmin, see that it seemed to work OK (one character rescued), and open it back up to members for a while. That's a possibility. Or shut down the forum for a day and do them all in one shot (taking frequent backups). Whatever floats your boat.

You could search for characters which you're sure shouldn't be in the forum (cent sign, small-f-hook, EBCIDIC "not" sign, etc.) to find many/most of the bad sequences, but I'm not sure how you could round up all the bad sequences in one search. The longer a bad sequence is, the lower the odds that it will be a subset of another bad sequence. Just make a reasonable guess after a reasonable amount of searching for a few target sequences that are the longest and none appear to be a superset of others. Bad case, you end up replacing the middle of some sequence with the wrong character. But of course you wrote down (or cut and pasted and saved) the sequence, so you can manually restore those botched sequences. Worst case, you do a database restore.

richardwbb · March 21, 2013, 07:44:32 PM

Ok, I have to reread your advice, I can tell there indeed is a better way to handle a live db, you actually are updating my way of code juggling which I say hoorah for!

I will get back with my findings. Especially if I am correct that special characters have different mangled character sequences or not.

MrPhil · March 27, 2013, 09:32:31 AM

Quote from: richardwbb on March 19, 2013, 11:39:27 AM
This is a €:

ÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â€šÂ¬Ã...Â¡Ãƒâ€šÃ,Â¬49,-

Be careful. It appears to be €49,- but any original € is probably going to be expanded to ÃƒÆ'Ã,Â¢ÃƒÂ¢Ã¢â€šÂ¬Ã...Â¡Ãƒâ€šÃ,Â¬

Quote
This is a ë:

BelgiÃƒÆ'Ã†â€™Ãƒâ€šÃ,Â«

Of course, that's België, but ë should always expand to ÃƒÆ'Ã†â€™Ãƒâ€šÃ,Â«

richardwbb · April 07, 2013, 02:09:25 PM

I have been looking in my database with Notepad++

The former € occurs only 52 times which is not often enough.

This happens to be a Euro sign too? ÃƒÂ¢Ã¢â,¬Å¡Ã,Â¬

I have checked for the Ã character, I have no other way of trying to estimate how many mangles I have, it counted 14000 matches, so that would be a couple of thousands of mangled characters.

I feel I can search and replace safely, but can you tell me how to approach finding them all?

I also feel I can't do any better then search and replacing, more advanced approaches are beyond me.

MrPhil · April 07, 2013, 07:11:47 PM

Were all the backups/restores done together, or might have some posts been made on a live site between such operations? If the latter, one UTF-8 character like a Euro might have a different number of UTF-8 -> Latin-1 -> UTF-8 expansion cycles than another. Your example may have gone through one or two fewer cycles than the longer one. It's just going to complicate things for you, I'm afraid. One thing that you might do is simply eyeball the posts (wherever there are suspicious accented characters) and you might be able to figure out that this one is a Euro sign and that one is a high-9-single-quote (apostrophe) from their context. That could either give you a working set to match and replace, or just manually edit them back to what they should be.

News:

My ISO-8859-1 / UTF-8 woes (for the last time, then I shut up about it)

kat

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil