News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Wrong parsing of UTF characters in board names categories and posts in 1.1 RC2

Started by spiros, December 31, 2005, 07:03:12 PM

Previous topic - Next topic

agridoc

spiros I have to learn quite a few things about UTF.

I think I can test UTF for multilingual without the Greek UTF translation as the codepage is the same in all languages.

I would really appreciate if you could email me the Greek UTF 1.1 RC2 translation, or PM me a link. I have seen your message that you sent the official SMF 1.1RC2 translation to SMF, which now will have two versions, Greek and UTF-8. However it might take a few days to appear in the download section.

Thank you for your work for SMF in Greek and for your help with multiningual so far.
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

spiros

In fact, you do not even need the Greek files.

Do this:

1) Save the English language files as UTF-8 (especially index.english.php) using Notepad
2) Change the encoding in index.english.php to UTF-8
3) Upload them

You are ready to test RC2 with UTF-8.

Compuart

The problem with the collations is that before 4.0 MySQL didn't use them, between 4.0 and 4.1 MYSQL had an intertwined version of char sets and collations and from 4.1 it cannot work properly without the collation specified. If MySQL has been converted from an earlier version it's likely the columns will be configured using a latin_general collation, while if columns are changed afterwards the'll likely get the default collation of the latin character set which surprisingly is latin_swedish (I guess because MySQL is a swedish company).

I'll see if I can write a tool to fix the column collations.
Hendrik Jan Visser
Former Lead Developer & Co-founder www.simplemachines.org
Personal Signature:
Realitynet.nl -> ExpeditieRobinson.net / PekingExpress.org / WieIsDeMol.Com

spiros

compuart,

I had changed column collations to utf. This is not the point. RC1 had no such problems with unicode.

In fact, if you read the previous posts, I have installed in the same db, with latin columns, RC1 and it works perfectly fine with UTF!!!

Also, today I finished the Greek translation of minibb and it works perfectly with UTF-8 on the same database with latin columns!!!!

So I think it is mostly trying to find what has been broken in the changes from RC1 to RC2.

Compuart

Have you tried a clean install of RC2? I'm assume it'll work just as fine as RC1. The problem is most likely the upgrader changing some of the text columns causing a mismatch of collations.
Hendrik Jan Visser
Former Lead Developer & Co-founder www.simplemachines.org
Personal Signature:
Realitynet.nl -> ExpeditieRobinson.net / PekingExpress.org / WieIsDeMol.Com

spiros

Compuart,

If you read the thread carefully you will see that I tried

1) upgrade (failed)
2) clean install (failed)

I also tried 2 clean installs of RC1 with the same succesfull results.

It would be nice if I had some feedback from other people testing it with UTF-8.

agridoc

I have tried two clean installs of 1.1 RC2 (RC2 charter and RC2 public) and UTF-8 in my PC and there is a problem althouh not as severe as in spiros' case mainly with capital "Π" (at first I thought rectangles in MSIE were UTF errors, then Spiros revealed that it was polytonic Greek). Not only in Greek but in other languages (Russian for sure). I don' know Russian it was just obvious in a test page I use. :D

That's for spiros (in Greek)

ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΦnbsp;ΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω
άέήίόύώ
ΆΈΉΊΌΎΏ


Русский (Russian) RC2 - ISO-8859-1 (or 7)
Цnbsp;усский (Russian) RC2 UTF-8
(It's the title of the Russian SMF's support board)

I think many languages may have problem with UTF-8 as it is restricted to only some characters and is hard to diagnose without the necessary language knowledge.

I didn' t change the collation in MySQL, it's latin_swedish.

I have also to report a problem with special characters and UTF-8 in RC2, a few characters do not display correctly.

P.S. By the way Compuart, please take also a look at Multilingual search in SMF's site?, Grudge wanted to inform you about. Gri' s messages there are a bit out of subject.  ;D
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

spiros

Another clean install, new db, new everything. Again, same configuration, same problems, see sample post here:

http://www.nonsmokersclub.com/forums/index.php?topic=2.0

From this last test SMF RC2 displays a strong aversion to the Greek capital Π!!!!

Grudge

What's wrong with that sample post? It looks like greek to me?

EDIT: Oh, I assume the numbers are not suppossed to be there?!

EDIT2: Attached screen, it's odd that one character is an accent on one subject line, but square on the next.
I'm only a half geek really...

Compuart

Hmm, there seem to be two topics merged, with two different problems. Is both problem still occuring (and specifically in a clean 1.1 RC2 install)?

Quote from: spiros on January 01, 2006, 10:37:50 AM
This is what I get when I click on my messages:

Illegal mix of collations (latin1_swedish_ci,COERCIBLE) and (utf8_general_ci,IMPLICIT) for operation 'find_in_set'
File: /home/free/public_html/forum/Sources/PersonalMessage.php
Line: 380


FROM {$db_prefix}pm_recipients AS pmr
WHERE pmr.ID_MEMBER = $ID_MEMBER
AND pmr.deleted = 0$labelQuery", __FILE__, __LINE__);
list ($max_messages) = mysql_fetch_row($request);
mysql_free_result($request);

and
Quote from: spiros on December 31, 2005, 07:03:12 PMIt seems that non-Latin characters are corrupted in board names, categories and posts although in a quite insonsistent manner!  For example:

Χιο�?μο�? > Γενικά

http://www.nonsmokersclub.com/forum/index.php/topic,4.0.html
Hendrik Jan Visser
Former Lead Developer & Co-founder www.simplemachines.org
Personal Signature:
Realitynet.nl -> ExpeditieRobinson.net / PekingExpress.org / WieIsDeMol.Com

spiros

Hello Compuart,

With the new install I do not get the illegal collation thingy.

However, the real problem is not the square in the title (that can be fixed with a hack) but if you notice at the bottom right of your screenshot there is a Φnbsp which was actually a Π letter.

In fact, in Mozilla it does not display it as Φnbsp but as �.

Here is another test with polytonic Greek.

This is what it looks like in RC2:
http://www.nonsmokersclub.com/forums/index.php?topic=3.0

And this is what it looks like in RC1 (No problem here)
http://www.nonsmokersclub.com/rc1/index.php?topic=3.0

And here is the original text
http://www.mikrosapoplous.gr/syntipas.html

(use both mozilla and IE to see differences. Most of the boxes in IE would be eliminated if a unicode font was used in style sheet - not all though! Edit: I changed theme so these were sorted and you only get the Φnbsp; thingy replacing "Π".  As I said, the main problematic character appears to be the capital Π!!!)

agridoc

Oops, I made a typo mistake in the alphabet, I corrected the Greek alphabet display in my previous message.

With spiros' s new clean installs the problems are now same as mine, with capital Π". He uses UTF-8 collation , I use latin1.

I noticed that "Π" was displaying in a quoted word "Παρνασός". I tried this in my installation without success.

The Greek "Π" might have a relation with the Russian "Р" that seems also to have a problem with 1.1 RC2 UTF-8.

Spiros, I also noticed that greek search doesn't give results in your test forum http://www.nonsmokersclub.com/forums/index.php?topic=2.0, try it yourself.

So far also the special character "à" is not displaying in 1.1 RC2 UTF-8.
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

spiros

Hmmm you are correct, no search results for monotonic (plain) Greek, but in that post they ware converted to entities for some reason!

Try this one:
http://www.nonsmokersclub.com/forums/index.php?topic=3.0
(polytonic works)

and this new monotonic one
http://www.nonsmokersclub.com/forums/index.php?topic=4.0

In fact, this is why you cannot search Greek in this forum. Since it uses latin charset, all characters are converted to entities, and this is why UTF is so bloody important!

spiros

I run another test and changed the collation of all tables to unicode from latin and I had the same serious problems as I had with my previous 2 RC2 installations!!! When I rolled back the changes to Latin the extra errors remained, but when overwriting them with the same clean text only the Π error remained.

The real problem though is this:

Unicode text looks like this in both RC1 and RC2 databases:

μή Ï,,οι χλιδῇ δοκεῖÏ,,ε μηδ᾽ αá½?θαδίᾳ

in the former case it displays OK on the browser, in the latter there are problems with the Π character.

taka

Hi.

I've just posted a patch on SMF Coding Discussion forum, but thought you guys might be interested in my patch.  You can find the link to the zip in this thread:

http://www.simplemachines.org/community/index.php?topic=63778.0

Note that, the fix contains two fixes.  You probably aren't interested in the e-mail fix.  Because it's disabled by default, it's harmless anyway.

Taka

spiros

Hello Taka,

The serious problems appeared to RC2 but you mention RC1. Have you read this thread?

Quote
http://hiko-ki.com/patch_20060102.zip

NOTE: The zip file contains modified 1.1 RC1 files.

agridoc

Spiros I changed my message, I had already noticed the polytonic search and the differences in search.

As far as Multilingual search in SMF's site is concerned I don' t think this is the cause, please read my messages and let' s discuss this topic in the proper place and you can help there too with your experience. There is no only one way to success.
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

Grudge

spiros,

Please don't get too annoyed over this. We *are* seriously looking into this and are attempting to fix it. Some of your posts are coming off as a little too aggressive. Please bear with us, as I'm sure you appreciate these things are quite difficult to emulate, and hence fix. In addition, it's a combination of many different factors complicating it further. I can assure you that you're not being ignored :)

Grudge
I'm only a half geek really...

taka

Quote from: spiros on January 02, 2006, 06:13:42 PM
The serious problems appeared to RC2 but you mention RC1. Have you read this thread?
There might be a new problem introduced by RC2, but RC1 has a couple of UTF-8 related problems.  As Grudge mentioned, I think you are probably seeing a combination of different problems.

Speaking of RC2 with non Latin language, you can take a look at Logue's site:

http://forum.logue.tk/index.php [nofollow]

He's upgraded to RC2 yesterday.  As you can see, he's also using lots of non Latin board and topics.  I'm still seeing some character corruption, though.  I believe it's due to the same bug existed in RC1.

In any cases, I've sent my patch to Grudge and I expect him to look at what I've done in the patch.  It should at least fix rendering and posting problem that we've seen in RC1.

There might be a problem in database migration code (I'm writing this from 100% guessing).  That needs to be fixed by dev team or some other volunteer since I haven't upgraded to RC2 yet.

Grudge

taka,

We're looking at your code thanks. We are considering that upgrade may have affected things, we'll keep ploughing on ;)

Thanks,

Grudge
I'm only a half geek really...

Advertisement: