News:

Wondering if this will always be free?  See why free is better.

Main Menu

Problem with UTF-8

Started by Madzgo, December 18, 2010, 04:04:44 AM

Previous topic - Next topic

Madzgo

When i had SMF 1.1.11 i did "convert html entities to UTF-8 characters". Then, i upgraded to SMF 2.0RC4, and now letters čćđ (serbian latin; of course i uploaded language pack) don't work (? - the question mark shows instead of them). I again do "convert html entities to UTF-8 characters" but nothing happens.. Everybody on my forum is from Serbia or Montenegro, and now the forum is becoming unreadable..
Help, please?  :-\

Madzgo


MrPhil

The "letters that don't work" -- are they in posts or in category/board/topic titles (all stored in the database) or are they in fixed text from the language files (prompts, titles, labels, etc.)? First, check that your database is still using UTF-8. You can use phpMyAdmin to look at the table and field definitions, and can browse some post and title entries to see if they show up OK. Make sure that phpMyAdmin itself is displaying in UTF-8 (View > Page source, and check the <meta...charset= tag, or View > Character encoding). Check your language files to make sure they weren't corrupted at some point -- you should be able to use your host's file manager editor to look at them (make sure they're displayed in UTF-8 and not in Latin-x). Check your displayed pages to ensure that pages are actually being displayed in UTF-8 and not in some other encoding. One of those is going to spot where the problem is, and then we can figure out the cure.

mannyRUA

#3
We had this recently as we have a lot of Cyrillic text.

Our tech altered all the tables in the database to utf8_general_ci and that fixed it. We also had it on a blog recently with problems with some Polish letters and the solution was the same. Here is an excerpt from the email from our host:

QuoteThe database is able to handle Russian and Polish text, we will just have to change the collation of the tables and the database to utf8_general_ci to allow these characters, however it is not able to go back through and correct the characters it is only able to handle this for new additions to the databases. To make these changes you need to go log in to phpmyadmin and go to the database you would like to change it for and change the collation option under operations per table that will need to have the characters.

Kays

Hi, check the Settings.php file and see if the following line is included. If it's not there, try adding it.


$db_character_set = 'utf8';

If at first you don't succeed, use a bigger hammer. If that fails, read the manual.
My Mods

IceXaos

OMG I just had this issue a few days ago....

You need to .. as said .. change all database tables and columns to use UTF-8.  I'll see if I can find the lovely script I used quick and get back to ya, but no promises.  I just had a computer crash and reformat and lost all of my custom scripts.

IceXaos

Sorry, had to modify it a little as I found it online, but it should be okay now.  You may wanna add a wait timer if you have too much, but it should be okay if you have a decent server.

Note:  Your server may become unresponsive for a couple minutes, and the page will continue to look as if it's loading, but let it go.  Don't interrupt it, and remember to backup first just in case.  When it's done, you will see a bunch of text sayin' each table/column was done on the php page.

<?php

$host 
'localhost';
$user 'username';
$pass 'password';
$db 'database';

// your connection
mysql_connect($host$user$pass);
mysql_select_db($db);

// convert code
$res mysql_query("SHOW TABLES");
while (
$row mysql_fetch_array($res))
{
    foreach (
$row as $key => $table)
    {
        
mysql_query("ALTER TABLE " $table " CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci");
        echo 
$key " =&gt; " $table " CONVERTED<br />";
    }
}

// close connection
mysql_close();
?>

Madzgo

Should I run that script as a query or..?

Quote from: Kays on December 27, 2010, 05:27:26 PM
Hi, check the Settings.php file and see if the following line is included. If it's not there, try adding it.


$db_character_set = 'utf8';

Checked. It's there. I also tried to add an - between "utf" and "8" but nothing -.-

Quote from: mannyRUA on December 27, 2010, 05:09:54 PM
Our tech altered all the tables in the database to utf8_general_ci and that fixed it.
I think that the code Ice Xaos gave me is for this.

Quote from: MrPhil on December 27, 2010, 01:17:14 PM
The "letters that don't work" -- are they in posts or in category/board/topic titles (all stored in the database) or are they in fixed text from the language files (prompts, titles, labels, etc.)?
Only in posts. Actually, every serbian letter that I typed don't work. The ones used in translation do, for example in "Show unred posts since last visit" (Prikaži skorašnje nepročitane teme) - it works.

Quote from: MrPhil on December 27, 2010, 01:17:14 PM
First, check that your database is still using UTF-8.
I'm not sure what are you pointing at, but i think it's collation. Well, i can't choose utf-8 because it doesn't exists. I think all my tables are utf8_general_ci.

Quote from: MrPhil on December 27, 2010, 01:17:14 PM
You can use phpMyAdmin to look at the table and field definitions, and can browse some post and title entries to see if they show up OK.
Nope, posts have ? instead ČĆĐ even in database.

Quote from: MrPhil on December 27, 2010, 01:17:14 PM
Make sure that phpMyAdmin itself is displaying in UTF-8 (View > Page source, and check the <meta...charset= tag, or View > Character encoding).
It does.

Quote from: MrPhil on December 27, 2010, 01:17:14 PM
Check your language files to make sure they weren't corrupted at some point -- you should be able to use your host's file manager editor to look at them (make sure they're displayed in UTF-8 and not in Latin-x). Check your displayed pages to ensure that pages are actually being displayed in UTF-8 and not in some other encoding. One of those is going to spot where the problem is, and then we can figure out the cure.
This part is not clear to me,
Where to check how are my language files displayed? And how to check displayed pages?


Thank you all for trying to help, cause this is a major problem to many people here..

MrPhil

Quote from: madzgo on December 28, 2010, 07:44:50 AM
Should I run that script as a query or..?
You're referring to IceXaos's script? It's a complete PHP program that converts each table to UTF-8. Back up your database (just in case something goes wrong) and run it from the browser address bar. Then rename or get rid of it, as you don't want it laying around for hackers to play with. I assume that any table already UTF-8 will be left alone, but that's why you make a backup first...

Quote
Quote from: MrPhil on December 27, 2010, 01:17:14 PM
The "letters that don't work" -- are they in posts or in category/board/topic titles (all stored in the database) or are they in fixed text from the language files (prompts, titles, labels, etc.)?
Only in posts. Actually, every serbian letter that I typed don't work. The ones used in translation do, for example in "Show unread posts since last visit" (Prikaži skorašnje nepročitane teme) - it works.
That means the language files are OK. The posts are stored in the database. Now, did you/your tech actually convert the database text from Latin-x (or whatever the old encoding was) to UTF-8, or just change the "label" on the tables/fields to indicate that it's now UTF-8? AFAIK, MySQL has both operations, and it sounds like maybe the existing data wasn't actually changed (converted). The bytes in the data should have actually been changed (from the old encoding to UTF-8), and it sounds like they weren't.

Quote
Quote from: MrPhil on December 27, 2010, 01:17:14 PM
You can use phpMyAdmin to look at the table and field definitions, and can browse some post and title entries to see if they show up OK.
Nope, posts have ? instead ČĆĐ even in database.
If phpMyAdmin is itself displaying in UTF-8, and its tables are defined UTF-8, that means that the data itself must still be in the old encoding. It wasn't converted over.

Quote
Quote from: MrPhil on December 27, 2010, 01:17:14 PM
Check your language files to make sure they weren't corrupted at some point -- you should be able to use your host's file manager editor to look at them (make sure they're displayed in UTF-8 and not in Latin-x). Check your displayed pages to ensure that pages are actually being displayed in UTF-8 and not in some other encoding. One of those is going to spot where the problem is, and then we can figure out the cure.
This part is not clear to me,
Where to check how are my language files displayed? And how to check displayed pages?
You said earlier that the "canned" text from the files was properly displayed, so it sounds like the language files are OK.

From what you've described, it sounds most likely that only the labels on the tables and fields were changed, but the text data itself was never converted from the old encoding to UTF-8. You may have to change the labels back to the previous encoding, and then do the conversion to UTF-8. I don't think that simply "converting HTML entities to UTF-8" is going to do the job -- you have actual accented characters in the data that are still in the old encoding. That is, an HTML entity like &radquo; gets turned into the actual UTF-8 bytes for that character, while a raw accented character like Č isn't affected.

Does all of this apply only to old posts (and titles), and new ones have their non-ASCII characters properly in UTF-8? If so, that's bad news, because you have a mixture of encodings in the data. Doing an overall conversion from <old encoding> to UTF-8 is going to corrupt your text that is already in UTF-8. You will need to split out the two groups so that you convert only the old encoding to UTF-8. This might be done by shutting down the forum, exporting (backing up) the database to an .sql file, manually editing the .sql file to remove the "good" UTF-8 entries, emptying out the database, make sure the database tables and fields are configured to be UTF-8, importing the backup giving the correct old encoding for it (MySQL should convert to UTF-8 during the import). Now, if all went well, you should have your old posts back in as UTF-8. Go back to the backup, remove all the old posts (leaving only the new, correct UTF-8 text), and import that as a UTF-8 encoded file (it should be brought in unmodified). Re-enable your forum, and with luck you should have all your database text correctly encoded. See if phpMyAdmin gives you an opportunity to back up only data older than/newer than the date you changed over -- that would save you a lot of effort, but I don't know if it can be done.

Converting mixed encoding text is tricky business, and if you're not absolutely confident in your skills, you should have someone who knows what they're doing do it for you (preferably on site). And of course, back up (and check) your database before doing any of this, in case your helper is overconfident and doesn't do a backup themselves! At least, then, you can get back to your present state. And before doing anything, confirm that your old posts are still in the old encoding and your new ones are in UTF-8. If not (everything is still in one encoding or the other), it should be a much simpler process.

Madzgo

Topic not solved, because i find that too difficult and not clear to me :)
Anyway, thanks for your help..

tycms

Madzgo, please see this topic:
http://www.simplemachines.org/community/index.php?topic=416112.0

I think this is a bug with the update process of smf_2-0-rc4_update package.

Madzgo

Can be.

"
Does all of this apply only to old posts (and titles), and new ones have their non-ASCII characters properly in UTF-8? If so, that's bad news, because you have a mixture of encodings in the data. Doing an overall conversion from <old encoding> to UTF-8 is going to corrupt your text that is already in UTF-8. You will need to split out the two groups so that you convert only the old encoding to UTF-8. This might be done by shutting down the forum, exporting (backing up) the database to an .sql file, manually editing the .sql file to remove the "good" UTF-8 entries, emptying out the database, make sure the database tables and fields are configured to be UTF-8, importing the backup giving the correct old encoding for it (MySQL should convert to UTF-8 during the import). Now, if all went well, you should have your old posts back in as UTF-8. Go back to the backup, remove all the old posts (leaving only the new, correct UTF-8 text), and import that as a UTF-8 encoded file (it should be brought in unmodified). Re-enable your forum, and with luck you should have all your database text correctly encoded. See if phpMyAdmin gives you an opportunity to back up only data older than/newer than the date you changed over -- that would save you a lot of effort, but I don't know if it can be done."
Is there any tutorial for this?

Advertisement: