News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

Another UTF-8 question

Started by Square, June 28, 2012, 06:46:37 AM

Previous topic - Next topic

Square

Hi,

SMF 2.0.2
My forum is in UTF-8.
But I think that my database is in UTF-8 BOM. (Is that possible?)
So I checked http://wiki.simplemachines.org/smf/UTF-8_Readme and it says
"Go to Forum Maintenance > Convert the database and data to UTF-8"
but I simply can't find it in the admin menu.
The only thing I found is "Convert HTML-entities to UTF-8 characters" which is not what I need right now.

Thanks in advance. :)


MrPhil

"UTF-8 BOM" does not make any sense for a database. Where do you see this? BOM is Byte Order Mark, which is a three byte cow patty dropped at the very beginning of a UTF-8 file by overly "helpful" editors. BOMs should be ruthlessly exterminated wherever they're found in files, but I can't think of any case where you'd find them in the database. In a file, when editing in UTF-8 mode your editor should offer you the option to "save without BOM" -- always use it, and if you can, configure the editor to use that as the default. Or, edit in non-UTF-8 mode and remove the three strange characters at the very beginning of the file.

If your database is already UTF-8, you shouldn't see the option to convert it to UTF-8, as it's already done. Converting HTML entities to characters is a separate, and optional, step (replacing all &name; and &#nnn; characters by their native UTF-8 codes). It's not mandatory, but will slightly reduce the size of your database (native codes are 2 to 4 bytes, while entities can be 7 or 8 bytes, sometimes more).

Arantor

Ok, better question... What problems are you having abd what needs to be changed about how your forum works, exactly?

Square

Thanks.

I had SMF 1.1.18 and everything was going well.
I've installed 1.1.18 on a new place, restored my database on that new place, upgraded to 2.0.2. Then installed my own theme.
Everything is ok but now the non-latin characters have problems (question mark instead of some characters).
This is true for both old and new posts.

MrPhil

When you say "non-latin" do you mean accented Latin alphabet, or truly non-Latin (Greek, Cyrillic, Hebrew, etc.)? Is this just with the posts, or also with text from language support files (prompts, labels, etc.)? When you installed 1.1.18 (there is no 18; do you mean 1.1.8 or 1.1.16?) and created the database for it, did you specifiy that it's UTF-8 (I'm not sure the encoding always comes over in the backup). Was the backup created by phpMyAdmin, or by SMF (bad!). Anyway, check your database to confirm it's actually UTF-8. Check your forum pages to confirm they're actually UTF-8 encoding. When you say "question mark", is this a ? within a black diamond, or just a regular question mark? Most browsers use ?-in-diamond to indicate an invalid character encoding, which usually means your database is serving up Latin-1 while you're trying to display the page in UTF-8.

Square

Thank you.

Quote from: MrPhil on June 28, 2012, 05:47:39 PM
When you say "non-latin" do you mean accented Latin alphabet, or truly non-Latin (Greek, Cyrillic, Hebrew, etc.)?
truly non-Latin (Greek, Cyrillic, Hebrew, etc.


Quote from: MrPhil on June 28, 2012, 05:47:39 PM
Is this just with the posts, or also with text from language support files (prompts, labels, etc.)?
Only posts. Forum language is English only (even if some posts are other languages)


Quote from: MrPhil on June 28, 2012, 05:47:39 PM
When you installed 1.1.18 (there is no 18; do you mean 1.1.8 or 1.1.16?) and created the database for it, did you specifiy that it's UTF-8
1.1.16, sorry. Yes, I did specify UTF-8.


Quote from: MrPhil on June 28, 2012, 05:47:39 PM
Was the backup created by phpMyAdmin, or by SMF (bad!).
SMF
(I also have a daily backup from my host but the INSERT in the SQL file are incredibly long!)


Quote from: MrPhil on June 28, 2012, 05:47:39 PM
Anyway, check your database to confirm it's actually UTF-8.
Don't really know how to do that. I'm not using phpMyAdmin, I only have a SQL client (Navicat).


Quote from: MrPhil on June 28, 2012, 05:47:39 PM
Check your forum pages to confirm they're actually UTF-8 encoding.
In which page should I check please?


Quote from: MrPhil on June 28, 2012, 05:47:39 PM
When you say "question mark", is this a ? within a black diamond, or just a regular question mark? Most browsers use ?-in-diamond to indicate an invalid character encoding, which usually means your database is serving up Latin-1 while you're trying to display the page in UTF-8.
It's a regular question mark.
(There is only one post with a ? within a black diamond but I think it was already like that in my old forum)



Square

I forgot to say that I don't really mind about the old posts. But I would like to have it working for the new ones.

MrPhil

SMF's backup is very bad -- it frequently cuts off early (without warning), and has improper syntax that produces errors on restore. Try to find an alternative such as phpMyAdmin, or MySQL command line "export" of tables (BACKUP TABLE, etc.). Be sure to save the table structure (CREATE TABLE command so that you can rebuild the tables), such as by the command line SHOW CREATE TABLE. Work with your host on this -- they owe it to you to tell you how to fully back up your database on their system.

On a few random forum pages, in your browser, click on View > Character Encoding (or Character Set) and see what it says the current encoding is. It should say UTF-8 (assuming your forum is supposed to be UTF-8). In your database, if you don't have phpMyAdmin, maybe SHOW CREATE TABLE on the MySQL command line will tell you the text field encodings used. It may say something about "collation" which includes the text 'utf8' or 'latin1'.

If you are feeding Latin-1 characters (or Latin-x for Greek, Cyrillic, etc.) to a UTF-8 page, I would expect lots of ?-in-diamond characters. That you're mostly getting plain ? sounds like you're not displaying in UTF-8, but in Latin-1. If it's just some characters, it might be Latin-1 for both, and just a certain range are invalid.

Square

#8
Quote from: MrPhil on June 28, 2012, 07:34:06 PM
SMF's backup is very bad -- it frequently cuts off early (without warning), and has improper syntax that produces errors on restore. Try to find an alternative such as phpMyAdmin, or MySQL command line "export" of tables (BACKUP TABLE, etc.). Be sure to save the table structure (CREATE TABLE command so that you can rebuild the tables), such as by the command line SHOW CREATE TABLE. Work with your host on this -- they owe it to you to tell you how to fully back up your database on their system.
I have daily SQL backup from my host. I can also export SQL to my HDD from the phpMyAdmin page of my host.
I didn't use these SQL files because they have VERY long INSERT lines and then I can't restore.
I could edit the SQL files with notepad++ but it will take me hours and I don't know if it will solve the problem.
If you're 100% sure it will be ok to post text in Polish, Hebrew, Russian, etc after, I can try that.


Quote from: MrPhil on June 28, 2012, 07:34:06 PM
On a few random forum pages, in your browser, click on View > Character Encoding (or Character Set) and see what it says the current encoding is. It should say UTF-8 (assuming your forum is supposed to be UTF-8). In your database, if you don't have phpMyAdmin, maybe SHOW CREATE TABLE on the MySQL command line will tell you the text field encodings used. It may say something about "collation" which includes the text 'utf8' or 'latin1'.

If you are feeding Latin-1 characters (or Latin-x for Greek, Cyrillic, etc.) to a UTF-8 page, I would expect lots of ?-in-diamond characters. That you're mostly getting plain ? sounds like you're not displaying in UTF-8, but in Latin-1. If it's just some characters, it might be Latin-1 for both, and just a certain range are invalid.
My browser shows "Unicode (UTF-8)" on all the pages I checked.

MrPhil

So you're showing pages in UTF-8, but seeing a lot of plain question marks (?) and not ?-in-diamond invalid characters? I don't understand that. It sounds then like your database is UTF-8, but it would be good to check. Since you now say you have phpMyAdmin, just look at the table structures and see if all the text type fields show *utf8* rather than latin1* or something else. Is this happening on newer posts or just old ones from before UTF-8 conversion?

As for the INSERTs being too long to use, when you use phpMyAdmin to EXPORT tables, what is maximal length of created query? The default is 50000 bytes. Perhaps someone changed it to a larger number that can't be processed by the IMPORTing phpMyAdmin (MySQL), or has changed the packet size on the IMPORTing MySQL to be smaller. Use a consistent value (try decreasing the EXPORTing maximal length of created query until it works, if you can't find what the limit is on the IMPORTing side.

If you need to edit an .sql file on your PC, be careful about your character encoding so that you don't end up corrupting the files, or adding a BOM to the file. If you have UTF-8 data, and edit in Latin-1, bytes in the range of x80 through x9F might be treated as control characters and corrupt the data.

Square

Thanks for your reply.



Quote from: MrPhil on June 29, 2012, 10:31:51 AM
So you're showing pages in UTF-8, but seeing a lot of plain question marks (?) and not ?-in-diamond invalid characters? I don't understand that. It sounds then like your database is UTF-8, but it would be good to check. Since you now say you have phpMyAdmin, just look at the table structures and see if all the text type fields show *utf8* rather than latin1* or something else.
This is what I see:
MySQL connection collation: utf8_general_ci
MySQL charset: UTF-8 Unicode (utf8)


Quote from: MrPhil on June 29, 2012, 10:31:51 AM
Is this happening on newer posts or just old ones from before UTF-8 conversion?
Both.
I didn't do any conversion. Old forum was UTF-8.
New forum too: I checked UTF-8 when installing 1.1.16 and upgrading to 2.0.2.


Quote from: MrPhil on June 29, 2012, 10:31:51 AM
As for the INSERTs being too long to use, when you use phpMyAdmin to EXPORT tables, what is maximal length of created query? The default is 50000 bytes. Perhaps someone changed it to a larger number that can't be processed by the IMPORTing phpMyAdmin (MySQL), or has changed the packet size on the IMPORTing MySQL to be smaller. Use a consistent value (try decreasing the EXPORTing maximal length of created query until it works, if you can't find what the limit is on the IMPORTing side.
I'll have to check that.


Quote from: MrPhil on June 29, 2012, 10:31:51 AM
If you need to edit an .sql file on your PC, be careful about your character encoding so that you don't end up corrupting the files, or adding a BOM to the file. If you have UTF-8 data, and edit in Latin-1, bytes in the range of x80 through x9F might be treated as control characters and corrupt the data.
I always use UTF-8 in Notepad++.

MrPhil

I'm stumped at this point. Do you see any rhyme or reason as to which characters are getting changed to plain "?" question marks? Is this only from certain browsers, or certain PCs? Can you look at a post that shows this, in the database, and see if the data is correct there (or is already corrupted)? These are old posts from 1.1.16 days, as well as new 2.0.2 posts? If everything is UTF-8, I don't know why certain characters would be changed. Did you test 1.1.16 that it was working OK on the new server, before converting to 2.0.2, or did you migrate and convert at the same time? It's generally best to do one step at a time and make sure everything is OK, before going on to the next step.

Square

Quote from: MrPhil on June 29, 2012, 10:31:51 AM
As for the INSERTs being too long to use, when you use phpMyAdmin to EXPORT tables, what is maximal length of created query? The default is 50000 bytes. Perhaps someone changed it to a larger number that can't be processed by the IMPORTing phpMyAdmin (MySQL), or has changed the packet size on the IMPORTing MySQL to be smaller. Use a consistent value (try decreasing the EXPORTing maximal length of created query until it works, if you can't find what the limit is on the IMPORTing side.

Maximal length of created query is 50000 bytes. No indication about import maximal size.
However, the maximal import size of the SQL file is 8Mb. It's ridiculous. My database is small but bigger than that (even compressed).
Host answer: cut your SQL file in several...

That's why I did restore with Navicat instead of myphpadmin host page.

What do you mean by "consistent value"?


Square

Old posts are already corrupted in the new database.

Anyway, I think I'll restart all from zero and double check each step.
Of course, since I have to import a corrupted database, the old posts will be bad but I don't care.
The important thing is that the new ones work properly.
Then, I can edit the old posts if needed.

Many thanks again for your help, really appreciated!  :)

MrPhil

Quote from: Square on June 29, 2012, 11:27:47 AM
What do you mean by "consistent value"?

If EXPORT allows 50000 bytes and IMPORT allows only 10000 bytes, you would not be able to import. The values would not be consistent. At least, you would want IMPORT to allow at least as large as EXPORT. It wouldn't hurt to have IMPORT allowing larger than EXPORT.

Square

#15
Hi MrPhil,

I maybe found something...
When I look at my database I see
MySQL connection collation: utf8_general_ci
MySQL charset: UTF-8 Unicode (utf8)
However, when I look at the list of the 63 tables, collation, 2 are in utf8_general_ci, 61 are in latin1_general_ci!
I have no idea why, how and at which step it has been changed that way.

So, now I have 3 forums, each with his own database:
1. my old 1.1.16 where all tables are utf8_general_ci
2. my 1.1.16 to 2.0.2 where 2 tables are in utf8_general_ci, 61 tables are in latin1_general_ci
3. a completely newly installed 2.0.2 where all tables are utf8_general_ci and nothing was imported yet.

Knowing that I have to export my 2.0.2 database from forum 2 to forum 3, what do you recommend me to do to have all tables in utf-8?
I want to try from the phymyadmin host page. Could you please tell me each step?
As I wrote I don't mind about the old posts, I can edit them when the forum will work in UTF-8.

Another solution could be to somehow "repair" the database from forum 2.
But then again I don't know how to do that.


MrPhil

OK, I have seen plain ? in UTF-8 pages where someone cut and pasted in a Word document with "Smart Quotes". By any chance is this non-Latin text cut and pasted from other documents or Web pages? If so, I'll bet they aren't UTF-8. Some browsers will do their best to convert text in other encodings (including CP-1252) to UTF-8, but that's not guaranteed. Since Smart Quotes are so commonly abused in this way, it's fairly common to handle them, but all bets are off for other text.

If the "bad" characters have been stored in the database as plain ?, they're not recoverable. The original character byte(s) are gone, replaced by a ?. The text will have to be re-entered or edited to get the non-Latin text. Try to find the circumstances that produce this effect (e.g., cut and paste from ____ encoding, whether a Web page or a PC application) and either find a workaround or warn members not to do that.

Square

#17
Thanks.

It was good on my old forum.
I don't care about the old posts, I can edit them.
But therefore it must first work for new posts, which is not the case.

So, my question is: is there a way to "repair" the 61 tables which are not UTF-8 now on my forum 2?
I mean convert them from latin1 to UTF-8.



MrPhil

How did you end up with some tables UTF-8 and most of them Latin-1? If SMF Admin does not offer a "convert to UTF-8" (because it thinks your forum is already in UTF-8), you'll have to do each text field individually in phpMyAdmin. Go to the "structure" and look at each table's fields. If they are "latin1", click on "Change" and change the collation to one of the UTF-8 (probably "utf8_general_ci"). Click on Save. I think you will need to click on table Operations and under Table Options change Collation to "utf8_general_ci" and click Go. This should complete this table and you can go on to the next table. Maybe someone else knows a quicker way to get there?

Once you're fully converted to UTF-8, it is your choice whether or not to convert HTML entities to UTF-8 characters (in SMF Admin). Doing so will save you a few bytes for each entity, but make it harder to switch back to Latin-1, if you ever wish to do so.

Square

Quote from: MrPhil on July 01, 2012, 07:29:35 PM
How did you end up with some tables UTF-8 and most of them Latin-1?
If only I knew! :D

Square

Thanks.
I changed all the tables to UTF-8 in Operation.
Then I posted and it was not working.
Then I changed all the fields. Table by table, field by field, in structure.
And then it was working!

It was a lot of clicks! There is probably a shorter way to do it but... oh well. Important is that it works now. :)

I'll mark this thread as solved. Thanks a lot for your help!  :)

MrPhil

Yea! Check to make sure you're consistently UTF-8 all through the database, and that your forum output is UTF-8, and your language support files are all UTF-8 versions.

I wish SMF would just get on with it and make the whole thing UTF-8, period. No more conversions.

Square

Quote from: MrPhil on July 02, 2012, 09:25:56 AM
Yea! Check to make sure you're consistently UTF-8 all through the database, and that your forum output is UTF-8, and your language support files are all UTF-8 versions.

I wish SMF would just get on with it and make the whole thing UTF-8, period. No more conversions.

Actually I changed all fields from the 63 tables yesterday. Should be ok now.  ;)

Advertisement: