Wrong parsing of UTF characters in board names categories and posts in 1.1 RC2

Started by spiros, December 31, 2005, 07:03:12 PM

Previous topic - Next topic

taka

Grudge,

No problem :-)

I have a suggestion to the core development team.  Would you pass it on?

I think it's time for SMF project to switch the default charset from ISO-8859-1 to UTF-8.  If you look at the major web sites such as Google, Yahoo, and MSN, they are all using UTF-8 now.  Using UTF-8 makes things a lot easier than ISO-8859-1 once your presence reached certain point, more and more people on this planet using your product.

You may think it's too late to make such change, but if you delay one day, there will be more data in the database for sure.

Please consider using UTF-8 for 1.1 GM release.

Taka

spiros

Taka,

I wholeheartedly second your suggestion: I think unicode is the only way ahead for globalization. Joomla developers were smart enough to have core UTF support for 1.1. I guess people in SMF must have realized the importance of unicode to a certain extent.

We are here to test and report (at least myself - you can do more).

Grudge,

I am sorry if some of my posts came across as aggressive. They were not meant to be  :)

spiros

Compuart,

I do not know if this helps, but I checked with phpmyadmin on the same db where I have installed Joomla and the unicode entries there look perfect, ie:

Πόσες φορές άραγε κάποιοι Έλληνες πολίτες (γύρω ...    Έχοντας ζήσει στο εξωτερικό και έχοντας

Whereas SMF entries (both RC1 and RC2) look like this

μή Ï,,οι χλιδῇ δοκεῖÏ,,ε μηδ᾽...

spiros

Sorry to be a nuissance, do we have any updates on the UTF front?

Compuart

I'm still working on it. All changes have to be throroughly tested, especially when it comes to character sets. Although we'll have a better support for UTF-8 we also keep supporting other character sets. ISO-8859-1 will remain the default character set for most languages, but it will be possible to override the character set independent of the chosen language.
Hendrik Jan Visser
Former Lead Developer & Co-founder www.simplemachines.org
Personal Signature:
Realitynet.nl -> ExpeditieRobinson.net / PekingExpress.org / WieIsDeMol.Com

spiros

Ok Compuart, thanks for your feedback. If there is anything I can do to help in terms of testing please do not hesitate to contact me.

Compuart

Thanks, I'll certainly contact you as soon as I'm no longer able to find any bugs when using UTF-8.
Hendrik Jan Visser
Former Lead Developer & Co-founder www.simplemachines.org
Personal Signature:
Realitynet.nl -> ExpeditieRobinson.net / PekingExpress.org / WieIsDeMol.Com

agridoc

I started a new topic Multilingual in SMF 1.1RC2 without UTF for Greek and other languages? and I would like your opinion and support there too.

QuoteI don' t want to argue with this against UTF-8.

I want to report that a working multilingual solution exists in a previous SMF version without using UTF-8.

QuotePlease Compuart, as you are focused on multilingual, take a look at this solution too and [Uknown]' s patch. Spiros' s help would be appreciated in this topic too.
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

spiros

Agridoc,

This is hardly a solution. In fact this forum works on this principle: converting all non-latin scripts to html entities (just use view source on this case - say the Greek board- and compare with view source in a UTF page - what do you see?). The only real solution is unicode. It is not by chance that google and other major internet companies use unicode instead of converting all non latin output to... entities! ;)

As far as working solutions go, RC1 was quite near full UTF support. A few hacks here and there, and all the minor character corruption problems in titles and Last Posts / Latest posts were solved.

agridoc

We have not a quarrel between a UTF and non-UTF solution. I think both should be developed.
Quote from: agridoc on January 04, 2006, 12:19:49 PMSMF' s team should consider all the trends and possibilities. Compuart' s message shows they take care of all this.

We can discuss the limitations and other things about the non-UTF solution in http://www.simplemachines.org/community/index.php?topic=64142.0

I will also try to help, if I can, in the development of the UTF solution here.
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

taka

Quote from: spiros on January 04, 2006, 05:20:49 PM
This is hardly a solution. In fact this forum works on this principle: converting all non-latin scripts to html entities (just use view source on this case - say the Greek board- and compare with view source in a UTF page - what do you see?). The only real solution is unicode. It is not by chance that google and other major internet companies use unicode instead of converting all non latin output to... entities! ;)
Agreed.  A solution with entities are too European centric.  It won't work for Asian.  Technically, you can replace all characters by UNICODE code point entity representation, but UNICODE entity representation shouldn't be used that way.

BTW, another UTF-8 related bug.  In News.php, there's wrong use of substr which only works ISO-8859-X.

$row['body'] = strtr(substr(str_replace('<br />', "\n", $row['body']), 0, $modSettings['xmlnews_maxlen'] - 3), array("\n" => '<br />')) . '...';


Because of this code, RSS feed doesn't show body properly.  It's chopped off in the middle of UTF-8 sequence.  This substr should be replaced by mb_substr if it's available.  That solved my problem.

spiros

Taka,

Thanks for another excellent solution!
I hope Compuart has noted this.

mcgrelio


spiros

mcgrelio,

Yes, this is the standard procedure for converting a non-UTF site to a UTF one. However, the point here is how SMF handles posting in a UTF environment.

Different kettle of fish  ;)

spiros

Just another (strange) reason UTF-8 is recommended. In my case, I use a Greek encoding and Google indexes incorrectly the text in my forum. That is to say, it is indexed as if the encoding was not Windows-1253 and it was Latin. So in search results one sees extended ASCII characters (gobleddygook).

I contacted Google support and they recommended switching to UTF-8. So I guess, that, after all, if more topics are indexed correctly in other languages, the higher the publicity of the forum using it, and SMF as a consequence.

Click for a detailed report on this problem and Google's reply.

agridoc

I use windows-1253 as codepage in all SMF' s languages I have installed (a cheat).

I did a search in Google for my site and I get 10.300 results (not all from the forum), correct Greek titles and text in the display.

Test it here.

This is a test  search for a greek word in my site. Seems to work again.

I don' t argue against UTF-8, however I prefer another solution that works too with SMF 1.0x and I believe can be made to work with 1.1 too with help from SMF's support team. See  Multilingual in SMF 1.1RC2 without UTF for Greek and other languages?
  For Greek aeromodellers and our friends around the world  - Greek Button sets for SMF - Greeklish to Greek mod
Δeν αφιερώνω χρόνο για μηνύματα σε greeklish.

spiros

Yes, I know, in some cases it does work, in others not!

The case with my site is like 5% of Greek posts are indexed correctly and the rest are not!

This is very strange because it is not a uniform behaviour and hence cannot be easily ascribed to a specific reason.

NTA

I have this problem too when i using upgrade package to upgrade from SMF 1.0.5 to 1.1 RC2 , after converting all Message, Post, Boardname, MemberName in UTF-8 gone away :D (Thank god that i've tested only in localhost), i found that not Database's problem, i think convert tool has problem when working with UTF-8 encoding....

I'm looking forward your solution :D

bubux

#58
I found that void preparsecode(string &message, boolean previewing = false)
(Subs-Post.php)
is not utf-8 safe.

When I comment out preparsecode() in Post.php I can write my message safely.
but with preparsecode(..) I lost all my message body.

I tested with 1.1 RC2 mysql 5.0.16, apache 2.0.55 , and php 5.1.1 on NetBSD 3.0 macppc.

mysql compiled with utf8 default character set, database created in utf8.

and $txt['lang_character_set'] = 'UTF-8'; in index.english.php
( db: utf-8, php: utf-8 environment)

here is my test message.
Quote
[쿠키 스포츠] ○...오른무릎 근육 부상으로 최대 보름간의 재활에 들어간 박지성(25·맨체스터 유나이티드)의 공백에 대해 영국 현지 언론이 깊은 우려를 나타냈다.

맨체스터 지역신문 '맨체스터 이브닝 뉴스'는 10일(한국시간) "박지성이 버튼 앨비언과의 FA컵 3라운드 경기를 앞두고 가진 워밍업 도중 입은 무릎 부상으로 당분간 경기에 나설 수 없게 됐다"며 "알렉스 퍼거슨 감독은 박지성의 공백이 길어지는 것을 결코 원하지 않고 있다"고 전했다.

이 신문은 "박지성이 지난 해 여름 400만파운드의 몸값을 기록하며 PSV 에인트호벤에서 이적해왔지만 아직까지는 팀에 정착해나가고 있는 단계"라며 "박지성이 출장한 29경기에서 14차례 교체멤버로 나왔고 자신이 유용한 대체 요원임을 입증해왔다"고 평가했다.

신문은 이어 "박지성이 버밍엄 시티와의 칼링컵 8강전에서 맨유 입단 후 첫 골을 기록했고,그런 이유로 퍼거슨 감독은 박지성 없는 '빡빡한 1월'을 원하지 않을 것"이라고 덧붙였다.

축구 전문 사이트 ESPN 사커넷도 맨유 코너에서 박지성이 부상으로 칼링컵 4강 1차전에 출장할 수 없다는 '맨체스터 이브닝 뉴스'의 보도를 비중있게 다뤘다.

살인적인 일정을 소화해야 하는 맨유 입장에선 쉴틈 없이 그라운드를 누비는 박지성의 공백이 크게 느껴질 수 밖에 없다. 최소 10일,최대 보름의 재활 진단을 받은 박지성이 어느 정도 컨디션을 회복한다면 예상보다 빨리 팀에 복귀할 가능성을 읽을 수 있는 대목이다. 국민일보 쿠키뉴스 조상운 기자

scripter

Hello,

I got the same thing with UTF-8 encoding, but I used vietnamese.

It always attemp to change the character "à" to "ænbsp;"

I commented the function preparsecode(), but it can't help

See it here http://www.simplemachines.org/community/index.php?topic=65438.msg451703#new

Advertisement: