Migrating to Full Unicode Support With MySQL and utf8mb4

Started by hushpiper, March 03, 2015, 05:51:46 PM

Previous topic - Next topic

hushpiper

Disclaimer: this is NOT a solution that SMF can support, so do this at your own risk!

I was just migrating a forum running 2.0.9 from ISO-8859-1 encoding to full unicode, and since I didn't see all of these steps listed out anywhere, I figured other people might find it useful if I compiled the steps I took to make it work:

First, a note: plain utf8 isn't going to cut it if you want full unicode support, as one of our sharper forum members figured out. This is because mysql's unicode support is kinda sucky: the utf8 collation in mysql only supports around 6% of the entirety of unicode. That's the most widely-used portion of unicode, as it happens, so if you just use utf8 for foreign characters you're probably fine. But if you want the whole unicode package, what you actually need is what's called utf8mb4, which was added in mysql 5.5.3. More info on that here: hxxp:mathiasbynens.be/notes/mysql-utf8mb4 [nonactive]. If your host doesn't have mysql 5.5.3 or higher installed, you can't have full unicode support--but you can still benefit from this post. Just substitute "utf8" for everywhere "utf8mb4" is mentioned.

1. Export your database. You can do this through phpmyadmin, which your host probably has installed, or through ssh with something like "mysqldump -p -u dbuser dbname > database.sql"

2. Look at the db dump--it's just plaintext--because sometimes they're different. Wherever it says "charset=something", you need to change that to equal "utf8mb4". Wherever it says "collate=something", change it to equal "utf8mb4_unicode_ci". If your database dump doesn't have "collate" in it at all--mine didn't--just add it after the "charset". I suggest you utilize the search and replace feature of your text editor for this step. For my part, I just opened up the file in vim and ran:

:%s/CHARSET=latin1/CHARSET=utf8mb4\ COLLATE=utf8mb4_unicode_ci/g

3. Find the _messages table (just search the text for _messages). You'll see a line saying something like this:

`poster_ip` varchar(255) NOT NULL DEFAULT '',

You need to change "255" to "191", or else it'll give an error on import. This has to do with different length-limits in utf8mb4, so if you're just doing plain utf8 you can ignore this step.

4. Import your database!

5. Log in to your admin panel. Install the utf-8 english language pack and enable it. You may have to fiddle a bit with your theme here--I had to copy the images/english folder in my theme to images/english-utf8 to get the "new post" images to show up, for example.

6. In your Settings.php, find:

$db_character_set = 'utf8';

And change it to:

$db_character_set = 'utf8mb4';

This is important, as certain unicode characters still won't be properly recorded in the database if you don't change this setting. This is also, of course, unnecessary for people doing plain utf8.


And you should be up and running at that point! Let me know if you have questions, criticisms or suggestions: I'm just a tinkerer and may have gotten things wrong, but this seemed to work well for our forum.

Kindred

Or use the "convert to UTF-8" function inside the smf admin tools
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

Arantor

Firstly,  convert to UTF-8 in the admin panel does not provide full Unicode support, secondly doing what the OP advises is strongly unsupported.

The slightly sad part is that there has been a patch for 2.0 and 2.1 available since August last year that provides complete UTF-8 mb4 support without changing the schema or database but has yet to be integrated.

hushpiper

Yeah, I read the discussion about that patch, but since it isn't in the current versions I had to improvise. Of course, I wouldn't expect support with bugs that come up with such an in-depth modification! I'll note that the "convert to UTF-8" function was not present in the admin tools for me for some reason--why, I don't know, the forum had no mods at that time and had not been changed from stock in any way other than a theme--so I wasn't even able to get basic utf-8 support without changing the schema manually. This wasn't too difficult and doesn't appear to have resulted in any bugs (my forum users are pretty good bug testers, bless them), so switching to mb4 was not a huge issue.

I'm curious as to how the patch operates without changing the database schema. Is it available anywhere for me to view? If so, I'd love to have a look at it and see if I can get it working on my setup (the current database setup is not yet live) in the interest of forward-compatibility.

P.S. Was gonna edit my first post to put a disclaimer that anybody doing this shouldn't come to you guys for bug support, but I don't seem to have edit permissions. If you could edit that in, I would appreciate it.

margarett

Quote from: hushpiper on March 04, 2015, 11:58:32 AM
P.S. Was gonna edit my first post to put a disclaimer that anybody doing this shouldn't come to you guys for bug support, but I don't seem to have edit permissions. If you could edit that in, I would appreciate it.
Done, thanks ;)
Se forem conduzir, não bebam. Se forem beber... CHAMEM-ME!!!! :D

QuoteOver 90% of all computer problems can be traced back to the interface between the keyboard and the chair

Arantor

In depth modification? It was about 20 lines, in one specific file because the person that wrote it knew what they were doing.

The patch works by finding any 4 byte UTF-8 character (the ones that aren't supported by regular UTF-8) and converts those to entities. After that, it's stock UTF-8 as far as both SMF and MySQL are concerned and both unencoding and reencoding get handled completely since it modified $smcFunc directly to achieve this so it would also just work natively everywhere.

shawnb61

A couple of questions to clear up some confusion I have reading the thread above.  Like many others, I'm getting the 'illegal mix of collations' errors.  Our forum is several years old & has been patched & mod'd many times - we now have a mix of charsets & of collations.  I'm new to this & working on a plan to clean this up.   I want to pick one charset & one collation & move forward. 

Question 1:  When you say this process is strongly unsupported, is the issue that UTF8MB4 is not supported by SMF?   Or are you scared to death thinking about supporting hundreds of noobs such as myself stepping thru process outlined above?

Question 2:  Is there a documented list of supported charsets & collations for SMF?  Or are all fair game?   

Question 3:  Is UTF8MB4 is unnecessary in SMF as long as you use UTF8 & occasionally execute the "convert to UTF8" admin function?   (I'm reading between the lines here, let me know if I'm reading this correctly...  I'm a simple dude, I need things stated outright...)

Question 4:  Any sense when the held-up patch will be available?   Am I correct in figuring that this patch eliminates the need to run "convert to UTF8" by hand once in a while?   

Thanks folks, excellent & informative discussions and support on this forum. 
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

margarett

Howdy

QuoteQuestion 1:  When you say this process is strongly unsupported, is the issue that UTF8MB4 is not supported by SMF?   Or are you scared to death thinking about supporting hundreds of noobs such as myself stepping thru process outlined above?
utf8mb4 is not supported by SMF because our minimum requirements are waaaay before it was even around.

QuoteQuestion 2:  Is there a documented list of supported charsets & collations for SMF?  Or are all fair game?
Officially we support UTF8 and ISO 8859-1 (MySQL refers to it by "latin1")

QuoteQuestion 3:  Is UTF8MB4 is unnecessary in SMF as long as you use UTF8 & occasionally execute the "convert to UTF8" admin function?   (I'm reading between the lines here, let me know if I'm reading this correctly...  I'm a simple dude, I need things stated outright...)
utf8mb4 is not "unnecessary"... It is a problem (or a limitation) in MySQL itself. I would suggest that, if one has a strong heart :P actually go ahead with this tip.

QuoteQuestion 4:  Any sense when the held-up patch will be available?   Am I correct in figuring that this patch eliminates the need to run "convert to UTF8" by hand once in a while?
2.0.10 already adds support for these "special" characters.
Se forem conduzir, não bebam. Se forem beber... CHAMEM-ME!!!! :D

QuoteOver 90% of all computer problems can be traced back to the interface between the keyboard and the chair

shawnb61

Thanks for the response, very helpful!

Time to go test a few things...   
Address the process rather than the outcome.  Then, the outcome becomes more likely.   - Fripp

Advertisement: