Advertisement:
NameCheap

Author Topic: charset=ISO-8859-1 and UTF-8  (Read 33891 times)

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
charset=ISO-8859-1 and UTF-8
« on: January 02, 2010, 11:45:56 AM »
Right now forum is working great with ISO-8859.

But in future I may need to use unicode characters out of iso-8859.

How easy to convert from ISO-8859 to UTF-8 now?

I was searching forums and it seems whole leap of problems. But early the better, I guess.

==================

Which of both supports multi-language web sites better?

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #1 on: January 02, 2010, 06:04:57 PM »
"ISO-8859" means nothing. "ISO-8859-1" is the Latin-1 (western European languages) encoding. There are maybe a dozen different encodings in the ISO-8859-x family, so you have to be careful about which one you're talking about. What they have in common is that they are single byte, the first 128 characters are the standard ASCII set, and they have a different set of accented characters in the upper 128 slots. There are sets for eastern European languages, Baltic/Scandinavian languages, and other Latin-based alphabets, as well as Greek, Hebrew, Arabic, Cyrillic, and maybe some other non-Latin-based alphabets.

Unicode is a double byte (16 bit, = UTF-16) encoding encompassing every alphabet on Earth. The first 128 slots (0000 - 007F) are the same as ASCII, and the next 128 (0080 - 00FF) happen to be the same as Latin-1. Various other alphabet standards got dropped into Unicode, sometimes unaltered, sometimes rearranged a bit. UTF-8 is a compression method whereby the most common characters (the ASCII set 0000 - 007F) come through as single bytes 00 - 7F, a big chunk of (mostly) Western alphabets (0080 - 07FF) get two bytes, and the rest get 3 or 4 bytes.

If the alphabet(s) you want to use for all your languages together cover more than 96 accented-Latin/non-Latin-based/non-ASCII characters, you have no choice but to go to UTF-8. Otherwise, you will consume a bit more space than with an ISO-8859-x encoding, but you'll have future flexibility if a new member wants to type in some, say, Japanese text.

There are three places where you are concerned about character encoding.

1) Database: The database really doesn't care what encoding the character data is, except when it comes to sorting (collating) text. If you feed in UTF-8 data to a database table defined as Latin-1, it may sort a bit differently than you expected. But, no data is lost.

2) Language support text files: Most non-English files (text for headings, titles, prompts, button labels, etc.) are in UTF-8. English is in ASCII, and so is compatible with Latin-1 and UTF-8 pages.

3) Browser: The browser is told what encoding text is being sent in (and what encoding to return input data in). The default is Latin-1 (ISO-8859-1), but the other usual choice is UTF-8.

Needless to say, items (2) and (3) really need to match up if you don't want gibberish on your page. It's not uncommon to have UTF-8 text (double byte accented characters) coming out of a database or language support file, and being displayed on a page declared to be Latin-1. This produces two odd-looking accented characters instead of the desired one. It's nice to have item (1) match up (database as UTF-8, rather than Latin-1), but it's not critical.

It's not hard to get all three items consistent (all UTF-8), but you do have to take care that the data currently in your database is actually Latin-1 and not already UTF-8 (caused by manually setting the page display to be UTF-8, while leaving the database at Latin-1). When you use the tool to convert the database to UTF-8, it should translate accented characters (if any) in your tables to UTF-8, resulting in a small increase in size. You will now be using your <language>-UTF8 file for language support, and your page heading should include a Content-Type that tells the browser to render as UTF-8.

Be sure to put the forum in maintenance mode ($maintenance = 1 in Settings.php) before starting a translation to UTF-8, so you don't get any new incoming data in the wrong encoding into your database. And of course, back up your database first so that if something goes wrong, you can restore to your original state.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #2 on: January 02, 2010, 06:39:31 PM »
Quote
It's not hard to get all three items consistent (all UTF-8), but you do have to take care that the data currently in your database is actually Latin-1 and not already UTF-8 (caused by manually setting the page display to be UTF-8, while leaving the database at Latin-1).

How to confirm it that data in database is latin1 or UTG-8. I installed SMF without checking the checkbox for utf-8 option.

Quote
When you use the tool to convert the database to UTF-8, it should translate accented characters (if any) in your tables to UTF-8, resulting in a small increase in size.

Which tool I have to use?

Quote
You will now be using your <language>-UTF8 file for language support, ...

Where is this file? I need to select any settings in SMF adminpanel?

Quote
... and your page heading should include a Content-Type that tells the browser to render as UTF-8.

It is simply changing the characterset in index.template.php. right?


Is there step by step guide to convert this ISO-8859-I to utf-8?

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #3 on: January 02, 2010, 07:18:35 PM »
How to confirm it that data in database is latin1 or UTG-8. I installed SMF without checking the checkbox for utf-8 option.
MySQL's default installation is Latin-1 (ISO-8859-1) with Swedish collation (MySQL AB is based in Sweden) -- just a few little oddities in how to sort the alphabet. Use your hosting service control panel and possibly the phpMyAdmin tool from it to examine the database and see what the encoding/collation is on the tables. Check just a few tables -- if they're all Latin-1, it's pretty safe to assume all the others will be, too.

Quote
Which tool I have to use?
In the Admin section of SMF there is an option to convert to UTF-8. Use that.

Quote
Where is this file? I need to select any settings in SMF adminpanel?
They appear to be in the Themes directory trees and maybe in the forum root (Settings.<language>.php). All you should have to do is install any language packs that you wish to support.

Quote
It is simply changing the characterset in index.template.php. right?
If you use the Admin conversion tool, it should set everything up for you and you shouldn't have to do any manual changes.

Quote
Is there step by step guide to convert this ISO-8859-I to utf-8?
Yes, there is somewhere (but offhand I don't know where). I remember seeing links to it recently, so if you do a search on UTF and convert (or something like that) I bet you'll find it pretty quickly.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #4 on: January 02, 2010, 07:49:13 PM »
OK. I found out:
http://docs.simplemachines.org/index.php?topic=865

THESE ARE THE STEPS?

1. 'Forum Maintenance' -> 'Convert the database and data to UTF-8'
2. Language package (in my case it is english-utf-8.php)
3. Change language settings for users (WHERE IS THE PRESENT LANGUAGE FILE?)
4. change the default language in your admin center - Admin -> Server Settings

That is it?

My first question:

BEFORE I PROCEED ABOVE STEPS, I need to export database, search for all occurrences of "CHARSET=latin1", replace them with "CHARSET=utf8", drop the existing database, re-create it, and then re-import the modified dump file to re-populate the database?

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #5 on: January 03, 2010, 01:45:19 PM »
OK. I found out:
http://docs.simplemachines.org/index.php?topic=865
That seems to be the document others are pointing to. I'm a little concerned that it hasn't been updated since 1.1 RC3 -- does anyone know if it's still current?

Quote
THESE ARE THE STEPS?

1. 'Forum Maintenance' -> 'Convert the database and data to UTF-8'
2. Language package (in my case it is english-utf-8.php)
3. Change language settings for users (WHERE IS THE PRESENT LANGUAGE FILE?)
4. change the default language in your admin center - Admin -> Server Settings

That is it?
I would first put your forum in "maintenance mode", just to make sure no one is working on it while you're making changes. Don't forget to update the message to users of why you're offline. At this point I'd do the database backup, and just set it aside once you've looked inside it and made sure that all the table values were backed up. There are a couple of nasty bugs in Sources/DumpDatabase.php that won't lose you any data, but may make restoring this backup a pain, if you have certain characteristics to your data. You might want to consider updating your DumpDatabase.php before doing the backup (attached).

Your current encoding is Latin-1 (ISO-8859-1), and you will be converting to UTF-8. Your current language file(s) are "english" and possibly other languages? Look for *english* files all through Themes/ and likewise, any other languages you have installed (non-English may also be in /Settings.<language>*). You will need to have corresponding english-utf8 files (which I think are identical to just "english") and the -utf8 versions installed (in / and Themes/) in place.

Changing the users' language settings is not done in any file. You go into phpMyAdmin and in the SQL tab run the given query to change everyone's <language> to <language>-utf8. Of course, first you will check to see if any (or all) are already -utf8 (which means the help page needs updating!). After running the SQL to update the entries, list them all again to make sure all where changed and that you don't have any unexpected language files that you need to add (e.g., you had "albanian", but you forgot to load "albanian-utf8". Make sure that your default language is likewise updated. Make sure that Settings.php has the entry for "UTF8" now added to it, so that the pages will be displayed with UTF8 encoding rather than Latin-1.

That should be all you need. Don't forget to restore the "out of service" message and take the forum out of maintenance mode ($maintenance = 0) once you're done and satisfied that it converted correctly.

Quote
My first question:

BEFORE I PROCEED ABOVE STEPS, I need to export database, search for all occurrences of "CHARSET=latin1", replace them with "CHARSET=utf8", drop the existing database, re-create it, and then re-import the modified dump file to re-populate the database?
No! You should not have to do any such manual operation. You export the database (make a backup), but it won't be used unless you have problems and need to roll back your changes to your current Latin-1 encoding. The forum conversion step ALTERs each table, changing its encoding and (as I understand it) actually converting the data.

Note that I'm just reading the instructions -- I haven't gone through a conversion myself, so you may want to talk to someone who has, if you're still not sure of what you're doing. Once you understand the principles, and what file and database changes are going to be made, you can make the appropriate backups before starting, and be able to roll back the changes if things go terribly wrong. If things go only slightly wrong, you may be able to manually fix things without starting over.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #6 on: January 03, 2010, 03:22:57 PM »
1. Put the forum in maintenance mode.

2. Backup database using 'Forum maintenance > Backup database' after uploading your DumpDatabase.php
(Questions here: I need to check the checkbox - 'Save the table structure.' during backup?)

3. 'Forum Maintenance' -> 'Convert the database and data to UTF-8' > click on it to convert.

4. Replace Settings.english.php that is in /Themes/Languages/ to Settings-english-utf-8.php from:
http://download.simplemachines.org/?languages;lang=english
(For now I am using only english languages)

5. Run the query for users to convert their language.
(Question here: The 'lngfile' column in table smf_members is empty for ALL users. So still I need to run the query?)

6. Settings.php > add utf-8 to it.

7. Revert from maintenance mode.

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #7 on: January 03, 2010, 07:43:35 PM »
2. Backup database using 'Forum maintenance > Backup database' after uploading your DumpDatabase.php
(Questions here: I need to check the checkbox - 'Save the table structure.' during backup?)
1) save your old copy of Sources/DumpDatabase.php just in case...
2) no harm in saving the table structure -- it's useful if you need to completely re-create the database anew. If you re-use an existing table structure, you just need to comment out the CREATE TABLE statements.
Remember, with luck you'll never need this backup (better to have it and not need it, than...).

Quote
4. Replace Settings.english.php that is in /Themes/Languages/ to Settings-english-utf-8.php from:
http://download.simplemachines.org/?languages;lang=english
(For now I am using only english languages)
I would not remove any of the current *english* files. They may be needed as a fallback in certain cases. Just add the new files in for all the languages you choose to support.

Quote
5. Run the query for users to convert their language.
(Question here: The 'lngfile' column in table smf_members is empty for ALL users. So still I need to run the query?)
If it's empty, I think that means "english" (non-UTF8). So, presumably you'll want to change all the empty ones to "english-utf8".

Quote
6. Settings.php > add utf-8 to it.
I think one of the above steps will add a line to Settings.php -- I don't think you need to do that manually. Anyway, check if anything was added.

Quote
7. Revert from maintenance mode.
First, run a bit (as the admin ID) to see if everything looks like it converted correctly. It ought to look just the same as it did before. Check a post or two that uses accented characters to make sure the conversion went OK. Try entering a test post with accented characters, to make sure it's saved and displayed OK. If everything is fine, throw open the doors.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #8 on: January 08, 2010, 09:56:55 AM »
Quote
Quote
5. Run the query for users to convert their language.
(Question here: The 'lngfile' column in table smf_members is empty for ALL users. So still I need to run the query?)
If it's empty, I think that means "english" (non-UTF8). So, presumably you'll want to change all the empty ones to "english-utf8".

What exactly the query should look like?

I am just going to use english-utf-8.php for now.

Thank you for your time.

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #9 on: January 08, 2010, 10:32:11 AM »
What exactly the query should look like?

For any member with a non-blank entry:
Code: [Select]
UPDATE smf_members
SET lngfile = CONCAT(lngfile, '-utf8')
WHERE lngfile != '';

Then, for any member with a blank entry:
Code: [Select]
UPDATE smf_members
SET lngfile = 'english-utf8'
WHERE lngfile = '';

Try that. Run them in that order.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #10 on: January 08, 2010, 11:20:52 AM »
OK I will be on this this weekend. thank you.

I will post here how it goes (or scream help!).

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #11 on: January 10, 2010, 11:49:53 PM »
Everything went smooth.

Will update again after posting some posts.

thank you Phil for your time.

My question: Are there any checks to see that everything went smooth and converted completely?
« Last Edit: January 11, 2010, 12:24:37 AM by pittu »

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #12 on: January 11, 2010, 11:10:23 AM »
I can not type accented characters into the post message. Any setting I need to change ?

I pressed Ctrl and , and releasing both and pressing c

Nothing is happening. (but Pressing Alt button and pressing 128 digits on numberkeys is working)

======================

One more Q: I need to uninstall and install mods? It seems they are installed on english files. (I didn't uninstall mods before converting though.)
« Last Edit: January 11, 2010, 11:21:45 AM by pittu »

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #13 on: January 11, 2010, 12:25:01 PM »
Quote
I can not type accented characters into the post message.
That's going to depend on how your PC and browser are set up. Note that the Alt-nnn code should produce accented characters depending on the "native" encoding specified for your PC (probably some MS variant of Latin-1), but I'm not sure how they get mapped (if at all) to UTF-8. For example, Alt-231 on my PC (Win XP, US English, Firefox 3.5) gives a lower case "tau" rather than a Latin-1 c+cedilla. Your OS and browser may or may not support alternate input methods such as Ctrl+accent mark dead keys.

When, oh when, will browsers support a built-in glass keyboard that lets you type in any character supported by the current page encoding? I'm going to have to try out a couple of Firefox plug-ins: abcTajpu and International Sideboard.

Quote
utf8 "\xA0" does not map to Unicode
That's not a legitimate UTF-8 character. It sounds like maybe you've got a Latin-1 encoded "&nbsp;" somewhere on your page, in a language file, or in a post. Can you give the link to this page, so someone can figure out where this character is lurking?

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #14 on: January 11, 2010, 12:31:08 PM »
That &nbsp; problem, I sorted it out. The index.template.php has that, I removed and saved it. It was fixed.

One more Q: I need to uninstall and install mods? It seems they are installed on english files. (I didn't uninstall mods before converting though.)

thanks.

Everything converts fine and no errors so far.  :D

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #15 on: January 11, 2010, 02:33:18 PM »
If your mods added entries to the *english* files, you'll probably have to manually copy the changes to the corresponding *english-utf8* files.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #16 on: January 11, 2010, 05:04:58 PM »
If your mods added entries to the *english* files, you'll probably have to manually copy the changes to the corresponding *english-utf8* files.
I can't just uninstall and install again? Just 3 mods I am using.

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #17 on: January 11, 2010, 06:36:42 PM »
You can try it, but there's no guarantee that it will install into your *english-utf8* rather than *english*.

Offline pittu

  • Jr. Member
  • **
  • Posts: 279
Re: charset=ISO-8859-1 and UTF-8
« Reply #18 on: January 11, 2010, 07:43:41 PM »
So that means any future mods, I need to install manually?

In Settings > I made english-utf8 as default language. So why mods are installed in english? ???

MrPhil

  • Guest
Re: charset=ISO-8859-1 and UTF-8
« Reply #19 on: January 11, 2010, 10:46:39 PM »
Not necessarily. It all depends on whether there are any language-relevant strings and how exactly the mod author decided to handle them. I'm just saying that you should be prepared to copy new text out of *english* and into *english-utf8*, if that's what's needed.