[Change] I hope SMF all developer and other language provide change to UTF8 defa

explorer1979 · October 27, 2005, 09:52:22 PM

Hi all,

I am Chinese, and discovered a thing on SMF, the default language of course is english, it is a international language, but for some other non-english language country, like Hong Kong SAR, China, Taiwan etc and all over the world other country ... and look forward of the near future all business will connect by the internet and net ...

UTF8 is the well new standard to solve all language problems ... if I am wrong, correction me

So after I am using 1.0.5 + the T.Chinese UTF-8 language package (Actually, I want to using the S.Chinese lanaguage package, but it only have the 1.0.2 version, after installed it on 1.0.5 forum, have problems .., so choose T.Chinese UTF-8 version)

I just discovered, if let the user change to english language interface, it is default not using UTF8, so all the chinese word will be change to like #@$#RRETFA%R#$R etc ....

So I thinking ... why SMF developer not change english version default using UTF8? It will solve some non-english, and using UTF8 language package like me problems ...

And all other language package provide also help for makeing the UTF8 version of their language package ... it is wonderful and maybe fix many problems ...

Just my hope and suggestion...

And can someone teach me how to change my english forum, all change to UTF8??

CrayZ · November 02, 2005, 08:10:38 AM

I think I need an explenation about this isue to. I'm havin the same problem ( I think )

Elmacik · November 03, 2005, 10:43:08 AM

Correct me if I am wrong but UTF-8 is not a new standard and its the oldest

AzaToth · November 03, 2005, 01:04:21 PM

Quote from: Elmacik on November 03, 2005, 10:43:08 AM
Correct me if I am wrong but UTF-8 is not a new standard and its the oldest

UTF-8 is the most optimal encoding to use for the web.
This because this propities (from man utf-8)

UCS characters 0x00000000 to 0x0000007f (the classic US-ASCII characters) are encoded simply as bytes 0x00 to 0x7f (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters > 0x7f are encoded as a multi-byte sequence consisting only of bytes in the range 0x80 to 0xfd, so no ASCII byte can appear as part of another character there are no problems with e.g. '\0' or '/'.
The lexicographic sorting order of UCS-4 strings is preserved.
All possible 2^31 UCS codes can be encoded using UTF-8.
The bytes 0xfe and 0xff are never used in the UTF-8 encoding.
The first byte of a multi-byte sequence which represents a single non-ASCII UCS character is always in the range 0xc0 to 0xfd and indicates how long this multi-byte sequence is. All further bytes in a multi-byte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8.

Elmacik · November 03, 2005, 01:12:18 PM

AzaToth, I didnt object its being optimal. I know it

I just said, its not new

AzaToth · November 03, 2005, 04:39:23 PM

Quote from: Elmacik on November 03, 2005, 01:12:18 PM
AzaToth, I didnt object its being optimal. I know it
I just said, its not new

hehe, everything is relative, in relation to ASCII it's new.

CrayZ · November 19, 2005, 06:05:00 AM

How is possible... since I changed my server i see some strange characters

MrPhil · December 31, 2010, 01:01:49 PM

Yes, SMF should switch to UTF-8 only. The advantages are that no more confusion with missing English-UTF8 language pack or incompletely switched-over database definitions when changing over to UTF-8 after starting out in ISO-8859-1. Maintenance and distribution will be simplified with only UTF-8 languages offered (and "utf8" can be dropped from the file names). All forums will be able to handle non-ASCII and non-Western characters entered by users. The downside is that language files currently in ISO-8859-n will become a bit larger in UTF-8, but that shouldn't be a serious impact, and current users will have to convert over whether they like it or not. SMF will also need to provide an optional filter to (at the forum admin's discretion) force only ASCII or only Latin-1 characters for user names and posts. Conversion to UTF-8 only won't be painless (witness all the problems people have now with trying to convert ISO-8859-1 to UTF-8), but in the long run SMF will be better off that way. Other PHP applications, such as Drupal and osCommerce, are going (or have already gone) this way.

Are there any hosts (deserving of the name) who cannot support UTF-8 (e.g., not offering "wide" character functions)? There are apparently some servers that override page charset selections and force Latin-1 for all pages. Is it worth keeping SMF in the Stone Age for the benefit of users on such hosts?

Since this is a fairly major change for SMF, I would suggest deferring it to 2.1 (and 1.2, if there is ever going to be such a stream). On the other hand, 2.0 might be as good as time as any, if its final release isn't just around the corner.

P.S. If it isn't already done so, theme and mod code needs to be checked for non-ASCII hardcoded characters before SMF endorses it by putting it in the download area of this site. Code files (as opposed to language files) should always use HTML entities for non-ASCII characters, but many authors seem to overlook this.

Antechinus · December 31, 2010, 04:29:22 PM

I agree with this and it has been talked about. Makes a lot of sense IMO. Can't really do it for 2.0 at this stage but sounds good for versions after that.

live627 · December 31, 2010, 07:00:30 PM

Wonderful! Is it tracked in Mantis so no one forgets it?

Arantor · December 31, 2010, 07:36:29 PM

QuoteP.S. If it isn't already done so, theme and mod code needs to be checked for non-ASCII hardcoded characters before SMF endorses it by putting it in the download area of this site. Code files (as opposed to language files) should always use HTML entities for non-ASCII characters, but many authors seem to overlook this.

Until earlier this year, Subs.php explicitly had some special characters in it, Subs-Charset.php still does, for emulation of case folding where the mb_* functions are not available.

It's possible to strip out all non UTF-8 handling with approximately 500 edits vs RC4's codebase, but it really isn't for the faint of heart.

Norv · January 14, 2011, 08:30:04 PM

Quote from: live627 on December 31, 2010, 07:00:30 PM
Wonderful! Is it tracked in Mantis so no one forgets it?

There are many things (ideas and features) not tracked in Mantis and not forgotten. That's why we have todo lists as well.

Back on topic, this is under discussion - as many things concerning upcoming versions. As I already mentioned a couple of times, I believe we will at least switch to UTF-8 by default.

News:

[Change] I hope SMF all developer and other language provide change to UTF8 defa

MrPhil