Advertisement:

Author Topic: utf8 not properly detected  (Read 2126 times)

Offline alkisg

  • Semi-Newbie
  • *
  • Posts: 28
utf8 not properly detected
« on: February 19, 2011, 10:34:08 AM »
smcFunc['substr'], if used on non-ascii utf8 strings where usually 1 character == 2 bytes, it returns the first N bytes instead of the first N chars.
So instead of e.g. limiting subjects to 100 characters, you're limiting them to 50 international characters, which is too little.

Using SMF 2.0 RC5 with greek/utf8 locale.
« Last Edit: February 19, 2011, 11:49:54 AM by alkisg »

Offline Arantor

  • Resident Overthinker
  • SMF Friend
  • SMF Legend
  • *
  • Posts: 73,189
Re: smcFunc['substr'] counts bytes not utf8 chars
« Reply #1 on: February 19, 2011, 10:38:04 AM »
Funny, since it passes the u parameter to preg_split to tell it to use UTF-8 matching, and in all my tests it's worked correctly every time, both in ISO and UTF-8. That assumes $utf8 is set, of course.

Also, note that the subject limit is 80, not 100, and that it uses VARCHAR, so assuming you have the database collation correct the DB won't be limited either.
No good deed goes unpunished
All helpful urges should be circumvented

Offline alkisg

  • Semi-Newbie
  • *
  • Posts: 28
Re: smcFunc['substr'] counts bytes not utf8 chars
« Reply #2 on: February 19, 2011, 10:49:48 AM »
No, not in the DB, the 100 characters limit (which results to 50 international chars) is hardcoded in the source:

Code: [Select]

$ grep -rw 100 * | grep substr
MoveTopic.php: $_POST['custom_subject'] = $smcFunc['substr']($_POST['custom_subject'], 0, 100);
Post.php: $form_subject = $smcFunc['substr']($form_subject, 0, 100);
Post.php: $_POST['subject'] = $smcFunc['substr']($_POST['subject'], 0, 100);
Post.php: $_POST['subject'] = $smcFunc['substr']($_POST['subject'], 0, 100);
SplitTopics.php: $new_subject = $smcFunc['substr']($new_subject, 0, 100);
SplitTopics.php: $target_subject = $smcFunc['substr']($target_subject, 0, 100);
Subs-Package.php: 'preview' => substr($current['data'], 0, 100),
Subs-Package.php: 'preview' => substr($file_info['data'], 0, 100),

When I changed those to "200" I could again use 100 international chars in subjects.

Here's my phpinfo(), if relevant.

Offline Arantor

  • Resident Overthinker
  • SMF Friend
  • SMF Legend
  • *
  • Posts: 73,189
Re: smcFunc['substr'] counts bytes not utf8 chars
« Reply #3 on: February 19, 2011, 10:52:52 AM »
Firstly, the ones in Subs-Package should be left alone.

But $smcFunc['substr'] DOES actually handle Unicode characters correctly unless either your configuration is screwed up (and the forum thinks it's not in UTF-8 mode) or your version of PHP is sufficiently old it doesn't have a version of PCRE bundled that can handle Unicode correctl.
No good deed goes unpunished
All helpful urges should be circumvented

Offline alkisg

  • Semi-Newbie
  • *
  • Posts: 28
Re: utf8 not properly detected
« Reply #4 on: February 19, 2011, 11:49:24 AM »
You're right, the problem isn't in smcFunc['substr'], it's in the $utf8 variable, it's not set to "true" as it should.

I restored all the "200" substr parameters to "100" as they were.

Then I modified "Load.php" to get some more info:
Code: [Select]
// UTF-8 in regular expressions is unsupported on PHP(win) versions < 4.2.3.
$utf8 = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8' && (strpos(strtolower(PHP_OS), 'win') === false || @version_compare(PHP_VERSION, '4.2.3') != -1);

// I put the following lines:
print "\nglobal_character_set=" . $modSettings['global_character_set'];
print "\nlang_character_set=" . $txt['lang_character_set'];
print "\nPHP_OS=" . PHP_OS;
print "\nPHP_VERSION=" . PHP_VERSION;
print "\nutf8=$utf8\n";
$utf8 = true;

And I got the following output:
Code: [Select]
<b>Notice</b>:  Undefined index:  global_character_set in <b>/alkisg/tosteki/Sources/Load.php</b> on line <b>181</b><br />
 
global_character_set=
lang_character_set=
PHP_OS=SunOS
PHP_VERSION=5.2.12
utf8=

My index.greek-utf8.php correctly contains:
Code: [Select]
$txt['lang_character_set'] = 'UTF-8';
so I've yet to pinpoint where the problem lies.


But, forcing $utf8=true did solve all my other problems, so the topic title is completely wrong, I'm changing it to "utf8 not properly detected".
« Last Edit: February 19, 2011, 11:54:09 AM by alkisg »

Offline Arantor

  • Resident Overthinker
  • SMF Friend
  • SMF Legend
  • *
  • Posts: 73,189
Re: utf8 not properly detected
« Reply #5 on: February 19, 2011, 11:53:03 AM »
Was it installed as UTF-8 or was it converted to UTF-8 after installation?

It's not a detection issue, it's that while your language file includes the right definition, the line isn't being added to Settings.php to set it for everything else (since it needs to be declared there nice and early)
No good deed goes unpunished
All helpful urges should be circumvented

Offline alkisg

  • Semi-Newbie
  • *
  • Posts: 28
Re: utf8 not properly detected
« Reply #6 on: February 19, 2011, 11:57:10 AM »
I converted it to UTF-8 in SMF 2.0 RC1, if I remember well (been upgrading since tripodyabb).
I will add it to Settings.php manually, now it only contains:
$language = 'greek-utf8';      # The default language file set for the forum.

Thank you very much for your help. :)

Offline Norv

  • SMF Friend
  • SMF Super Hero
  • *
  • Posts: 18,313
  • Blue Wolf
Re: utf8 not properly detected
« Reply #7 on: February 19, 2011, 12:15:14 PM »
There was a bug in SMF 2.0 pre-RC4, on some international forums Settings.php file was losing in particular cases the correct:
Code: [Select]
$db_character_set='utf8'; // the respective setting for the forum
This should be solved in SMF, however, if your file doesn't have it, please do add it indeed.
To-do lists are for deferral. The more things you write down the later they're doneā€¦ until you have 100s of lists of things you don't do.
File a security report | Developers' Blog | Bug Tracker

Also known as Norv on D* | Norv N. on G+ | Norv on Github

Offline alkisg

  • Semi-Newbie
  • *
  • Posts: 28
Re: utf8 not properly detected
« Reply #8 on: February 19, 2011, 01:00:22 PM »
I already had it there:
Code: [Select]
$db_character_set = 'UTF8';

I tried to change it to lowercase, but it didn't make any difference.

So I just added this line to Settings.php, which made everything run smoothly:
Code: [Select]
$txt['lang_character_set'] = 'UTF-8';

Offline bcaza

  • Newbie
  • *
  • Posts: 5
Re: utf8 not properly detected
« Reply #9 on: October 13, 2020, 10:13:19 AM »
This seems to have solved my issue in 2.0.17 - thanks a lot!  Had over 32000 logged errors!