UTF-8 questions

akc42 · February 24, 2013, 05:42:56 AM

I an converting a forum from 1.1.11 to 2.0.4. Before I do this for real, I have been undertaking a series of "dummy runs" to check I have every single step down correctly. This has shown a number of issues, which I have dealt with

One of the issue is with posts with the £ sign in them. If I don't do anything as I transfer the database, I end up with the case strange characters where the £ sign should be.

I tried to do a database convert to utf-8, and whilst it solved the problem for new posts, it appears to have truncated some posts at where the £ sign is.

One of the things the database convert did do is added a setting in the setting table of global_character_set with a value of UTF-8

I did a dummy run, and added that setting into the table manually. Bingo - I think that solved the problems of the posts without having any truncation. However it is a manual step that I have to undertake whilst under great pressure doing the real migration and it would be great if I could avoid it.

I have just been looking at a separate forum I installed a while ago, and noticed that my Settings.php file has the variable $db_character_set = 'utf8' in it. Using grep to check around the source it seems it used primarily whilst creating new tables. My guess is that this particular Settings.php was created via the installer.

I can't find the documentation for the variables in Settings.php on this site so I don't know for sure what it does, or whether this will prevent me having to add the extra setting into the database.

Can anyone give me some pointers to what this variable does and whether I need to add this global setting (global_character_set), or whether there is a better way of achieving the same result

MrPhil · February 24, 2013, 09:34:20 AM

Transferring/upgrading from 1.1.x to 2.0.x and conversion from Latin-1/ISO-8859-1 to UTF-8 are two completely separate issues. In theory, you should be able to do it in either order. On the assumption that the 2.0 code is maybe a bit more cleaned up than the 1.1 code, perhaps it would be better to convert to UTF-8 after upgrading to 2.0. In any case, you should not have to make any manual database or file changes.

It is possible that your 1.1 system is really using Windows-1252 text (cut and pasted from Word or Outlook) rather than true Latin-1. Windows-1252 is a superset of Latin-1, with some rarely used control codes in the x80 to xAF range replaced by "Smart Quotes" characters (more typographically pleasing punctuation, Euro sign, some accented letters). If your £ sign is really a legitimate Latin-1/Windows-1252 £ sign (they're the same code point), it should convert cleanly to UTF-8. Are you sure it's malfunctioning on £, and not some nearby "Smart Quote" character?

Keep in mind that MySQL defaults to Latin-1 encoding, so if you create a fresh database for 2.0 (rather than using upgrade.php), you might have to manually change it to UTF-8. You'll have to be more explicit about what kind of "strange characters where £ should be" you are seeing. Are you seeing two or even three funny accented Latin-1 characters in its place? That means that you have UTF-8 encoded characters being displayed on a Latin-1 page. Go to your browser's View > Character Encoding (or whatever it's called) and manually switch to UTF-8 and see if the page displays properly. Are you seeing a ? mark in a black diamond? That means your text is still encoded in Latin-1, and you're displaying the page in UTF-8, where the £ is improperly coded for UTF-8 and marked as undisplayable.

Bottom line: keep your 1.1.11 forum in Latin-1. Back up the database, then upgrade to 2.0.4 (still in Latin-1). Once everything is stable, back up again and convert to UTF-8. Note that if your server is already at PHP 5.4, you may have trouble at any point, as htmlspecialchars() and htmlentities() will no longer work in Latin-1 (until SMF is fixed). Also, some versions of PHP 5.4 have broken preg_replace() with the /e flag, used for converting HTML entities to UTF-8 characters. In that case, back off to PHP 5.3 or earlier to do the conversions. Once you're in UTF-8, PHP 5.4 will work.

akc42 · February 24, 2013, 10:02:11 AM

Its possible its my text editor. In order to migrate, I am dumping a Mysql Database with the standard latin1_Swedish_ci collation, opening it in a text editor (kde kate) and editing the theme_dir entries to the place on the new server.

On the new server, I have created a database with default utf8_general_ci collation. I then import the saved text file with the mysql command line program (doing source filename).

When I look at the database after its imported it looks like an ordinary pound sign in myPhpAdmin.

What then seems to happen is that during page load SMF checks the $modsettings['global_character_set'] and if set to 'UTF-8' seems to do the right thing. If its not set I end up with A circumflex followed by the £ sign. (There could be other errors, but I have been concentrating on the £ sign as it seems to be indicative of whether my settings are correct or not)

There is one particular post, that I know is easy to find as I have noted down its message id, and know its position on the forum. As an experiment, one time after importing the data and doing the upgrade, but without setting the global_character_set setting, I did the Forum Maintenance/Database convert to utf. It was during that run that this known post got truncated at the first £ sign. (all the text up to the £ sign was there, the rest of the post was missing - and I checked this on the database)

I am trying to get the migration from one server to the other and upgrade to 2.0.4 done is less than an hour, so I don't want to go through two backup cycles (my last attempt took 1hr 5 minutes).

$db_character_setting doesn't appear to be the magic bullet I thought it was.

[I have now got myself in a twist on a copy of all of this on another spare machine where upgrade.php is not loading the language txt entries at all, and the screen is full of PHP errors - but thats another question which I want to solve to remove some of the risk before I do this upgrade. The big day is 8am Tuesday, so I am covering off as many angles as I can before then]

MrPhil · February 24, 2013, 11:01:31 AM

OK, if you went behind SMF's back in doing the conversion to UTF-8, you will have to manually set the UTF-8 flag in SMF. I believe there's something in the smf_settings table ("global_character_set" = "UTF-8") giving the encoding, but that row is present only if you let SMF do the conversion (you have to add it manually in phpMyAdmin otherwise).

I trust that when you import a Latin-1 encoded backup into a UTF-8 database, that you're telling it that the backup file is Latin-1, so it will be properly translated on the fly into UTF-8 characters. No promises about what it will do with "Smart Quotes" characters found only in Windows-1252 source cut-and-pasted text. If you have those in your database, select Windows-1252 as the backup file encoding if it's available.

akc42 · February 24, 2013, 04:28:50 PM

Quote from: MrPhil on February 24, 2013, 09:34:20 AM
Note that if your server is already at PHP 5.4, you may have trouble at any point, as htmlspecialchars() and htmlentities() will no longer work in Latin-1 (until SMF is fixed). Also, some versions of PHP 5.4 have broken preg_replace() with the /e flag, used for converting HTML entities to UTF-8 characters.

Do you happen to know which versions. I've been wondering why these posts get eaten., and with a bit of trial and error and var_dump, censorText is turning a perfectly healthy post in to a null string at the preg_replace function.

I'm running Debian, (strange mixture of Sid and Testing as I slowly downgrade to Wheezy) and the PHP there is 5.4.4

akc42 · February 24, 2013, 05:06:45 PM

Quote from: akc42 on February 24, 2013, 04:28:50 PM
Quote from: MrPhil on February 24, 2013, 09:34:20 AM
Note that if your server is already at PHP 5.4, you may have trouble at any point, as htmlspecialchars() and htmlentities() will no longer work in Latin-1 (until SMF is fixed). Also, some versions of PHP 5.4 have broken preg_replace() with the /e flag, used for converting HTML entities to UTF-8 characters.

Do you happen to know which versions. I've been wondering why these posts get eaten., and with a bit of trial and error and var_dump, censorText is turning a perfectly healthy post in to a null string at the preg_replace function.

I'm running Debian, (strange mixture of Sid and Testing as I slowly downgrade to Wheezy) and the PHP there is 5.4.4

This might because of the length of the post. There is a comment left on the php manual for preg_replace that the limit is around the 4096 mark. The post in question is 4900+ characters long.

MrPhil · February 24, 2013, 06:19:35 PM

I see the claim of a 4096 limit from over a year ago, but has that been officially confirmed? Does anyone have access to an official bugs list (and fixed in what version)? I'm kind of surprised that limit hasn't been reported before in SMF. Could it be a specific compilation where the implementer arbitrarily chose some (smaller than usual) buffer size?

The preg_replace() with /e problem I saw someone mention a few days ago (5.4.7, IIRC). /e wasn't supposed to be deprecated until 5.5.0 but it looks like someone jumped the gun. I don't know anything beyond that. The default encoding change for htmlspecialchars() and htmlentities() is supposed to be with 5.4.0.

akc42 · February 24, 2013, 06:30:52 PM

I just found out that it is nothing to do with the length.

I am looking at another post which is only 272 characters long, and it appears to be related the characters in the string. The database field seems to just have standard double quote " characters, but the echo of the string at censor text has some strange black diamond with a question mark in it symbol

akc42 · February 24, 2013, 06:36:48 PM

Interesting

I just added $db_character_set = 'utf8' into my Settings.php and the posts display perfectly.

Arantor · February 24, 2013, 06:48:08 PM

QuoteI see the claim of a 4096 limit from over a year ago, but has that been officially confirmed

Nah, there's no size limit in preg_* functions. At least no actual *size* limit. There is a limit, however, based on backtracking which is a side effect of how the PCRE engine functions and is intended to curtail possible so-called ReDoS vulnerabilities where you create content in such a way that a validating regexp gets into a bind.

(When a regexp works, it starts by going through the string trying to match, and it'll get part way through the string and backtrack to find alternative/more/better matches. The exact process is explained beautifully in the O'Reilly tome Mastering Regular Expressions, and it even goes into explaining how PHP's specific variation of the regular expression engine is implemented. Yup, regular expressions are so scary there is an entire tome, of hundreds of pages, dedicated to the subject)

News:

UTF-8 questions

akc42

MrPhil

akc42

MrPhil

akc42

akc42

MrPhil

akc42

akc42

Arantor