News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

utf8 not properly detected

Started by alkisg, February 19, 2011, 10:34:08 AM

Previous topic - Next topic

alkisg

smcFunc['substr'], if used on non-ascii utf8 strings where usually 1 character == 2 bytes, it returns the first N bytes instead of the first N chars.
So instead of e.g. limiting subjects to 100 characters, you're limiting them to 50 international characters, which is too little.

Using SMF 2.0 RC5 with greek/utf8 locale.

Arantor

Funny, since it passes the u parameter to preg_split to tell it to use UTF-8 matching, and in all my tests it's worked correctly every time, both in ISO and UTF-8. That assumes $utf8 is set, of course.

Also, note that the subject limit is 80, not 100, and that it uses VARCHAR, so assuming you have the database collation correct the DB won't be limited either.

alkisg

No, not in the DB, the 100 characters limit (which results to 50 international chars) is hardcoded in the source:



$ grep -rw 100 * | grep substr
MoveTopic.php: $_POST['custom_subject'] = $smcFunc['substr']($_POST['custom_subject'], 0, 100);
Post.php: $form_subject = $smcFunc['substr']($form_subject, 0, 100);
Post.php: $_POST['subject'] = $smcFunc['substr']($_POST['subject'], 0, 100);
Post.php: $_POST['subject'] = $smcFunc['substr']($_POST['subject'], 0, 100);
SplitTopics.php: $new_subject = $smcFunc['substr']($new_subject, 0, 100);
SplitTopics.php: $target_subject = $smcFunc['substr']($target_subject, 0, 100);
Subs-Package.php: 'preview' => substr($current['data'], 0, 100),
Subs-Package.php: 'preview' => substr($file_info['data'], 0, 100),


When I changed those to "200" I could again use 100 international chars in subjects.

Here's my phpinfo(), if relevant.

Arantor

Firstly, the ones in Subs-Package should be left alone.

But $smcFunc['substr'] DOES actually handle Unicode characters correctly unless either your configuration is screwed up (and the forum thinks it's not in UTF-8 mode) or your version of PHP is sufficiently old it doesn't have a version of PCRE bundled that can handle Unicode correctl.

alkisg

#4
You're right, the problem isn't in smcFunc['substr'], it's in the $utf8 variable, it's not set to "true" as it should.

I restored all the "200" substr parameters to "100" as they were.

Then I modified "Load.php" to get some more info:

// UTF-8 in regular expressions is unsupported on PHP(win) versions < 4.2.3.
$utf8 = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8' && (strpos(strtolower(PHP_OS), 'win') === false || @version_compare(PHP_VERSION, '4.2.3') != -1);

// I put the following lines:
print "\nglobal_character_set=" . $modSettings['global_character_set'];
print "\nlang_character_set=" . $txt['lang_character_set'];
print "\nPHP_OS=" . PHP_OS;
print "\nPHP_VERSION=" . PHP_VERSION;
print "\nutf8=$utf8\n";
$utf8 = true;


And I got the following output:

<b>Notice</b>:  Undefined index:  global_character_set in <b>/alkisg/tosteki/Sources/Load.php</b> on line <b>181</b><br />

global_character_set=
lang_character_set=
PHP_OS=SunOS
PHP_VERSION=5.2.12
utf8=


My index.greek-utf8.php correctly contains:

$txt['lang_character_set'] = 'UTF-8';

so I've yet to pinpoint where the problem lies.


But, forcing $utf8=true did solve all my other problems, so the topic title is completely wrong, I'm changing it to "utf8 not properly detected".

Arantor

Was it installed as UTF-8 or was it converted to UTF-8 after installation?

It's not a detection issue, it's that while your language file includes the right definition, the line isn't being added to Settings.php to set it for everything else (since it needs to be declared there nice and early)

alkisg

I converted it to UTF-8 in SMF 2.0 RC1, if I remember well (been upgrading since tripodyabb).
I will add it to Settings.php manually, now it only contains:
$language = 'greek-utf8';      # The default language file set for the forum.

Thank you very much for your help. :)

Norv

There was a bug in SMF 2.0 pre-RC4, on some international forums Settings.php file was losing in particular cases the correct:

$db_character_set='utf8'; // the respective setting for the forum

This should be solved in SMF, however, if your file doesn't have it, please do add it indeed.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

alkisg

I already had it there:

$db_character_set = 'UTF8';


I tried to change it to lowercase, but it didn't make any difference.

So I just added this line to Settings.php, which made everything run smoothly:

$txt['lang_character_set'] = 'UTF-8';

bcaza

This seems to have solved my issue in 2.0.17 - thanks a lot!  Had over 32000 logged errors!

Advertisement: