Simple Machines Community Forum

SMF Development => Feature Requests => Next SMF Discussion => Topic started by: Suki on September 19, 2011, 11:18:43 AM

Title: [3.0] Full UTF8 support
Post by: Suki on September 19, 2011, 11:18:43 AM
Hi all, the Devs are considering going with full UT8 support instead of the current ANSI/UTF8.


We will like to hear all opinions about this, please try to consider all angles that can possibly influence this such as server requirements,  database sizes, hosting restrictions, etc.


Please share your thoughts on this :)
Title: Re: Full UTF8 support
Post by: Robert. on September 19, 2011, 11:32:35 AM
I always use the latin set, but going full utf8 would be a good idea though
Title: Re: Full UTF8 support
Post by: live627 on September 20, 2011, 06:36:38 PM
It's absolutely a no brainer. Chaarset support  problems would mostly vanish, large mods don't have to ****** around with including ANSI/UTF languages, translators need only to translate into utf-8 encoding.
Title: Re: Full UTF8 support
Post by: 青山 素子 on September 30, 2011, 11:39:30 AM
I have to agree with this. If it is technically possible, moving to full Unicode support would be beneficial. Most hosting providers have upgraded SMF's dependencies to versions that support this.
Title: Re: Full UTF8 support
Post by: Xarcell on October 05, 2011, 01:09:04 PM
+1

As it was said, a no brainer.
Title: Re: Full UTF8 support
Post by: 青山 素子 on October 06, 2011, 12:58:08 AM
Quote from: Xarcell on October 05, 2011, 01:09:04 PM
+1

As it was said, a no brainer.

It certainly wasn't when SMF 2.0 was being designed. Back then, PHP4 was still widely used (still is in some areas...) and proper Unicode support was difficult to come by without a ton of effort. With the huge move to PHP5 and cleaned up support, it would be silly to not fully support Unicode in a widely-used software.
Title: Re: Full UTF8 support
Post by: Dzonny on October 09, 2011, 07:35:49 PM
+1
It would be easier if new members don't have to deal with charset problems, full utf8 support would fix many possible issues.
Title: Re: Full UTF8 support
Post by: Daniel Hofverberg on October 11, 2011, 03:54:45 AM
Of course it's preferable with full UTF-8 support for those that do want it. However, I do not want to be forced to use UTF-8, as I prefer good old ISO-8859-1.
Title: Re: Full UTF8 support
Post by: Nightwish on October 11, 2011, 04:50:47 AM
Quote from: Daniel Hofverberg on October 11, 2011, 03:54:45 AM
Of course it's preferable with full UTF-8 support for those that do want it. However, I do not want to be forced to use UTF-8, as I prefer good old ISO-8859-1.
Bad idea.

Seriously, it's 2011. The web should be UTF-8 only. Period. 7/8bit character sets are a relict of ancient days and should die. They are among the most annoying things a developer has to deal with - Unicode makes things so much easier. Time to move on and forget old habits.
Title: Re: Full UTF8 support
Post by: Fustrate on October 11, 2011, 05:03:57 AM
Quote from: Daniel Hofverberg on October 11, 2011, 03:54:45 AM
Of course it's preferable with full UTF-8 support for those that do want it. However, I do not want to be forced to use UTF-8, as I prefer good old ISO-8859-1.

Do you have a reason for preferring it, or do you just not want to change? Honest question.
Title: Re: Full UTF8 support
Post by: Daniel Hofverberg on October 11, 2011, 05:10:21 AM
As the rest of my web site is using ISO-8859-1, using UTF-8 for just the forum would cause problems with the integration. That would make SSI.php and other aspects a pain to deal with, unless changing the character set on the entire site. I also don't see any specific need for UTF-8 on my site, as all characters I need to use for the Swedish language is present in Latin1.

As my site consist of closer to 1000 pages, moving the entire site over to UTF-8 with no real benefit doesn't really sound too pleasing...
Title: Re: Full UTF8 support
Post by: Fustrate on October 11, 2011, 05:12:42 AM
I believe there are PHP functions for converting between character sets, though I don't use them often enough to remember their names.

Found it, though: http://www.php.net/manual/en/function.iconv.php (http://www.php.net/manual/en/function.iconv.php)

You'd just use that on anything from the forum, to convert from UTF8 to ISO-8859-1
Title: Re: Full UTF8 support
Post by: 青山 素子 on October 11, 2011, 06:15:22 PM
Quote from: Daniel Hofverberg on October 11, 2011, 03:54:45 AM
Of course it's preferable with full UTF-8 support for those that do want it. However, I do not want to be forced to use UTF-8, as I prefer good old ISO-8859-1.

Could be worse, it could be windows-1252 (ugh).


Quote from: Daniel Hofverberg on October 11, 2011, 05:10:21 AM
I also don't see any specific need for UTF-8 on my site, as all characters I need to use for the Swedish language is present in Latin1.

Then you shouldn't see a difference, actually. The only possible issues would be with characters outside Latin-1 like the "smart quote" and such. Heck, you could probably send your Latin1 pages as UTF8 without any changes as HTML entities would still work the same way for characters.
Title: Re: Full UTF8 support
Post by: spiros on November 05, 2011, 08:37:13 AM
I could not agree more. There are so many mods which do not support UTF-8 and one has to hack them in order to work in a UTF-8 forum. This should NOT be happening.
Title: Re: Full UTF8 support
Post by: Kindred on November 05, 2011, 09:59:41 AM
well, that would be a MOD problem, not an SMF issue....
Title: Re: Full UTF8 support
Post by: spiros on November 05, 2011, 10:20:49 AM
Indeed, but if SMF does not enforce strict guidelines (i.e. mod compatibility with UTF-8), many mod developers are inclined to ignore it.
Title: Re: Full UTF8 support
Post by: Nightwish on November 06, 2011, 02:02:20 PM
Quote from: spiros on November 05, 2011, 08:37:13 AM
I could not agree more. There are so many mods which do not support UTF-8 and one has to hack them in order to work in a UTF-8 forum. This should NOT be happening.
Then trash these mods, period. No mod can be important enough to stand in the way of a modern design and getting rid of ancient character set support *is* an important part of a modern design.

Seriously, it's 2011, people who still insist on ancient 8bit character set support must have been sleeping under a rock for the past 10 years. Most modern web applications are UTF-8 only, because it is, by far, the easiest way to support multiple languages.
Title: Re: Full UTF8 support
Post by: spiros on November 06, 2011, 03:06:14 PM
I thought that php 6 (what happened to it by the way?) was meant to be the version to fully support UTF-8 in its core. If that had happened, there should not have been many excuses left.
Title: Re: Full UTF8 support
Post by: Fustrate on November 06, 2011, 06:26:53 PM
iirc, they turned PHP 6 into PHP 5.4
Title: Re: Full UTF8 support
Post by: Angelina Belle on November 08, 2011, 08:04:56 PM
I understand that converting code to work with UTF8 will take some work -- strlen() won't be dependable (depends on the language), etc. How difficult will it be for mod writers to convert their code?

If someone is trying to integrate an ISO-8859-1 website with a UTF8 SMF forum, and was using the iconv functions to convert EVERYTHING, what kind of performance hit would that take?

What are the implications of switching to using all UTF8 in MYSQL tables? Any affect on performance?

I have never made the switch to UTF-8 myself, since I don't feel I really understand the implications.
Title: Re: Full UTF8 support
Post by: Fustrate on November 08, 2011, 08:57:46 PM
As long as PHP is compiled with multibyte support (which it is by default, if my memory serves correctly) then you just use the multibyte functions instead, such as mb_strlen().

http://www.php.net/manual/en/ref.mbstring.php
Title: Re: Full UTF8 support
Post by: spiros on November 09, 2011, 07:42:08 AM
Quote from: AngelinaBelle on November 08, 2011, 08:04:56 PM
I understand that converting code to work with UTF8 will take some work -- strlen() won't be dependable (depends on the language), etc. How difficult will it be for mod writers to convert their code?

Sometimes it is very easy, i.e.
http://www.simplemachines.org/community/index.php?topic=49410.msg2412108;topicseen#msg2412108
Title: Re: Full UTF8 support
Post by: Oldiesmann on November 17, 2011, 11:48:10 AM
Someone posted a simple solution to the strlen issue on the PHP manual - strlen(utf8_decode($string)). We could probably use that as a last resort if mb_strlen and iconv_strlen aren't available.

I don't know anything about iconv so I can't say whether it would be a performance hit or not.

As far as I know, converting the tables to UTF-8 doesn't have any effect on performance.
Title: Re: [3.0] Full UTF8 support
Post by: spiros on March 08, 2012, 09:13:52 AM
By the way, joomla has gone UTF-8 only since version 1.5.
Title: Re: [3.0] Full UTF8 support
Post by: agridoc on March 30, 2012, 07:24:11 AM
Going UTF-8 as default: YES
Abandoning ISO/ANSI: Not so sure.

It needs thought and much work. Fantastico caused quite some problems with pseudo-UTF8 installatin in the past. This is a sticky topic in Greek support board (http://translate.google.com/translate?langpair=auto%7Cen&u=http%3A%2F%2Fwww.simplemachines.org%2Fcommunity%2Findex.php%3Ftopic%3D285256.0) (Google translation, not perfect but the problem can be understood)

As spiros noticed many mods don't work well with UTF-8, many mod writers have no experience with UTF-8 and table collations.

No problem for a new installation. A difficult task would be database conversion in upgrades, it's a procedure that must be done with care, prone to problems and possible data loss. Unless convinced otherwise in the future,  I believe that it should be a separate process.

Voices such as Daniel Hofverberg 's should not be neglected, although ISO-8859-1 SMF installations are the easiest case, almost no difference with UTF-8.

Heroic declarations as "throw these mods out" is something that many admin's wouldn't like to hear.
Title: Re: [3.0] Full UTF8 support
Post by: spiros on March 30, 2012, 07:39:53 AM
Let me put it this way:

ISO-8859-1 is not compatible with everything and is bound to cause internationalization problems.
UTF-8 is compatible with everything (including the scripts one traditionally uses ISO-8859-1 for).

Joomla is a paradigm of a major open source project gone full and only UTF-8. It is perhaps the most popular open source project to date.

My question is not whether one should use UTF-8 or not; rather, why it has been so long and it has not been implemented yet, and, when does one plan to do so.

Let's focus on the future, because the future is already here.
Title: Re: [3.0] Full UTF8 support
Post by: Angelina Belle on March 30, 2012, 07:45:33 AM
Everything I have heard tells me SMF 3.0 will have full UTF-8 support.

So part of the future includes how to properly convert existing forums to UTF-8, with as few errors as possible.
Title: Re: [3.0] Full UTF8 support
Post by: spiros on March 30, 2012, 07:51:23 AM
I converted mine 4-5 years ago with minor issues (without the SMF converter). I think the current built-in converter is quite good anyway.
Title: Re: [3.0] Full UTF8 support
Post by: Angelina Belle on March 30, 2012, 08:03:54 AM
That is excellent to hear.

What tips should we offer to mod writers who will have to rewrite to handle UTF8 properly?
Title: Re: [3.0] Full UTF8 support
Post by: spiros on March 30, 2012, 08:11:27 AM
Future-proofness (as close as they can get to immortality) instead of oblivion. One mod I had to "rewrite" myself: I just had to add a single line in the SQL code.
Title: Re: [3.0] Full UTF8 support
Post by: Angelina Belle on March 30, 2012, 09:27:51 AM
That was easy.
Title: Re: [3.0] Full UTF8 support
Post by: Dzonny on March 30, 2012, 10:00:57 AM
Mods can easely adjust to utf-8 forums with changing collation of tables and some minimal changes in files if it's needed. So i guess that mods will not be our main problem.
Title: Re: [3.0] Full UTF8 support
Post by: Angelina Belle on March 30, 2012, 10:04:09 AM
Do you think it would be helpful to provide a "guide to making your mod UTF-8 ready", or do you think that would just be silly?
Title: Re: [3.0] Full UTF8 support
Post by: Dzonny on March 30, 2012, 10:11:22 AM
That may be usefull, of course. Many mod authors don't use utf-8, so they don't know how to make a mod utf-8 compatible. However if full utf-8 support will be provided from 3.0 version i think that there must be some notice in Customization approval guidelines (http://wiki.simplemachines.org/smf/Customization_approval_guidelines) about this (required utf-8 compatibility) when 3.0 goes public.
Title: Re: [3.0] Full UTF8 support
Post by: Angelina Belle on March 30, 2012, 10:40:12 AM
Yep.  As a doc writer, I am thinking a little bit ahead. If I can persuade savvy mod-writers to share some secrets NOW, then I won't be scrambling later.
Title: Re: [3.0] Full UTF8 support
Post by: agridoc on March 30, 2012, 01:11:56 PM
In Customization approval guidelines it should be added that the mod should work properly at least with UTF-8, better with both UTF-8/ANSI, as we are still in 2.0x and there is quite some time and there will be quite some changes in the roadmap. for 3.0 final.

There are many excellent otherwise coders that have little knowledge of database collations and proper UTF-8 implementation. So we see that new tables, or new fields in existing tables are created with latin1_swedish_ci collation regardless of SMF installation being UTF-8 or not.

SMF matured for proper UTF-8 with version 1.1 RC3, Compuart did an excellent job then.
Foreign languages will benefit from UTF-8. Although SMF has some multilingual possibilities with ISO/ANSI and a little cheating, UTF-8 has clear advantages in proper sorting and better search. Also, as many other software have matured for UTF-8 or selected UTF-8 only bridging will usually be without problems.

One question
As far as I know, SMF from version 1.1 RC3 up is fully compatible with UTF-8. So the main question is leaving or not ISO/ANSI. Do I miss anything?
Title: Re: [3.0] Full UTF8 support
Post by: Fustrate on March 30, 2012, 01:20:53 PM
If people don't understand how to make their modifications UTF-8 compatible, we can certainly help them learn.

In order to reduce development and compatibility headaches, though, it's best to use UTF-8 only, instead of allowing a mix of encodings and hoping that most users choose UTF-8.
Title: Re: [3.0] Full UTF8 support
Post by: Angelina Belle on March 30, 2012, 01:46:10 PM
OK. 
I'm just an ignorant doc writer.  So if you write down the most important points, I can try to turn your gibberish into something the newer coders might understand.
Title: Re: [3.0] Full UTF8 support
Post by: agridoc on March 30, 2012, 02:34:40 PM
Quote from: Fustrate on March 30, 2012, 01:20:53 PMIn order to reduce development and compatibility headaches, though, it's best to use UTF-8 only, instead of allowing a mix of encodings and hoping that most users choose UTF-8.

Most users will use the default in installation. So with . With the addition of a warning for ISO/ANSI no problem with new installations.


@ AngelinaBelle: Some integrated (not Wiki) guidelines for proper language set selection depending on SMF installation would be useful for screens in
Administration Center » Languages
I find it also related to proper Localization.


Is it necessary to wait for 3.0 to make UTF-8 the default preselection?  Removing ISO/ANSI is something different. Adding UTF-8 as default in one next 2.0 subversion, after some additions, will reduce the following cases until SMF 3.0 will be ready.

What needs much thought and care are existing forum installations.

If I started my SMF today I would choose UTF-8. However, my SMF started 7 years ago, when SMF was not UTF-8 ready, as well as hosts and other software. Although i have helped others to convert, I don't find the time to do mine, it's not only SMF involved. The converter works well for all databases with SMF_prefix but needs careful planning and time, sometimes not so easy to find. I have seen in the past other software that had gone UTF-8 only without proper development and proper converter.

These cases are different, also as I have read, many admins would not like to be forced to UTF-8, although most understand that sooner or later will have to do the big jump.
Title: Re: [3.0] Full UTF8 support
Post by: spiros on March 30, 2012, 06:41:00 PM
Quote from: agridoc on March 30, 2012, 02:34:40 PM
These cases are different, also as I have read, many admins would not like to be forced to UTF-8, although most understand that sooner or later will have to do the big jump.

Well, this is the case exactly with Joomla. People had to go along with UTF-8 default (and only option) when upgrading to 1.5 whether they liked it or not. I did not see anybody getting any harm out of it.
Title: Re: [3.0] Full UTF8 support
Post by: agridoc on March 31, 2012, 01:14:20 AM
I have no information for SMF 3.0 and differences from 2.0 but Joomla 1.5 needed a migration, not upgrade, from 1.0 (http://docs.joomla.org/Migrating_from_1.0.x_to_1.5_Stable).
Also SMF is UTF-8 ready since 1.1 RC3 and has incorporated converter.

So, I find SMF more user friendly than Joomla. There are different approaches, Joomla is a successful software, that doesn't mean that every decision for his development was the only and the best.
Title: Re: [3.0] Full UTF8 support
Post by: agridoc on April 01, 2012, 04:32:14 AM
This topic reminded me of something I had in mind. So I did some work to finish an old plan: How to convert to UTF-8 (http://www.simplemachines.org/community/index.php?topic=472777.0)