[g330] SMF, php 5.4, htmlspecialchars and non utf-8 languages.

Started by digger, February 10, 2013, 03:49:54 PM

Previous topic - Next topic

digger

Since php5.4 htmlspecialchars have new default encoding. Now htmlspecialchars function use utf-8 encoding if third parameter not defined. But
SMF have many hardcoded htmlspecialchars() calls without smcFunc and not defined encoding. And we have some troubles with non English and non utf-8 forums.
All htmlspecialchars calls should be replaced with proper smcFunc.

emanuele

Exactly what instances of htmlspecialchars are creating issues?


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

digger

Quote from: emanuele on February 10, 2013, 04:10:52 PM
Exactly what instances of htmlspecialchars are creating issues?

For example, I don't see cyrillic filenames in the "Attachments and Avatars - Browse Files - Attachments" admin area. I see something like "548x730 54.64КБ".

In the ManageAttachments.php file
find
$link .= sprintf(\'>%1$s</a>\', preg_replace(\'~&amp;#(\\\\d{1,7}|x[0-9a-fA-F]{1,6});~\', \'&#\\\\1;\', htmlspecialchars($rowData[\'filename\'])));
replace with
$link .= sprintf(\'>%1$s</a>\', preg_replace(\'~&amp;#(\\\\d{1,7}|x[0-9a-fA-F]{1,6});~\', \'&#\\\\1;\', htmlspecialchars($rowData[\'filename\'], false, \'cp1251\')));
or
$link .= sprintf(\'>%1$s</a>\', preg_replace(\'~&amp;#(\\\\d{1,7}|x[0-9a-fA-F]{1,6});~\', \'&#\\\\1;\', htmlspecialchars($rowData[\'filename\'], false, \'\')));

and now I see cyrillic filenames like "_душ.JPG 548x730   54.64КБ"

There are many other same instances in the sources. Htmlspecialchars returns blank line if calls without encodings parameter and have non utf8 input string.
Users can't change some cyrillic values in text fields like forum title in the admin area. Can't use cyrillic smileys codes. Don't see cyrillic filenames of attachments or avatars.

Arantor

In fact it's pretty much every instance of htmlspecialchars that doesn't refer back to $smcFunc.

There is even a bug on Mantis about this, from years back. While there are some that should not be changed to smcFunc instances (I'm thinking primarily strlen here, where you need bytes not characters), I cannot envisage a case where bare htmlspecialchars() should be called without awareness of encoding.

digger

If the SMF is a multilingual forum, it is a critical bug for it.
Not utf-8 forums can't use many of the functions with the current version of php. This should be fixed or developers should drop support for not utf-8 and clearly inform about it.

Arantor

Dropping non UTF-8 support is a massive undertaking, but entirely doable.


Arantor

I do, but I'm not really in a position to do anything about it here. Elsewhere, that's another story entirely.

IchBin™

Quote from: digger on February 12, 2013, 07:34:50 PM
Nobody cares

One can only care so much when what they do here is volunteer free time out of their own personal life. Nobody gets paid to cater to your every issue or suggestion. If you feel that strongly about it, and have a fix to apply. Go to github and propose a pull request to fix it in the next version.

https://github.com/SimpleMachines/SMF2.1#readme
IchBin™        TinyPortal

theymos

This is a particular problem for forums using ISO-8859-1 because htmlspecialchars will throw away all of its input if it contains a non-breaking space character (because a string containing 0xA0 alone is not valid UTF-8). The non-breaking space character is very common, so this causes many problems: email notifications for PMs will frequently be blank; the file editor "randomly" removes lines; etc.

By the way, SMF should IMO not convert multiple spaces to non-breaking spaces, especially in [code] segments. Doing so produces different characters than the poster intended, which can cause problems. I'd keep the multiple spaces but put them in <span style="whitespace:pre"> so they aren't collapsed.

Oldiesmann

The reason why SMF has not gone "UTF8-only"  is because we still support older versions of MySQL which do not have character set / collation support.

Even with SMF 2.1, we will still be supporting MySQL versions as old as 4.0.18 (though I have no idea why).

At this point we will definitely try to fix as many issues like this as possible, but we can't rewrite half of SMF to support only UTF8 and still expect to get 2.1 final out by the end of the year.
Michael Eshom
Christian Metal Fans

Arantor

MySQL 5.0 stable came out around the same time as SMF 1.1. You really have no reason to have < 5.0 compatibility. There are already issues from even-earlier MySQL support (TYPE vs ENGINE) so going 5.0+ only will fix some of that.

Quotebut we can't rewrite half of SMF to support only UTF8 and still expect to get 2.1 final out by the end of the year.

It's 3 days work to gut the innards and replace it with UTF-8 only, a week tops. (And before anyone tells me otherwise... I already did this. It took 2 days and sporadic bug fixes thereafter, totalling no more than 3 days work for me.)

Your biggest problem there is the upgrader, not the core of SMF.

Oldiesmann

Quote from: Arantor on March 23, 2013, 01:43:37 PM
MySQL 5.0 stable came out around the same time as SMF 1.1. You really have no reason to have < 5.0 compatibility. There are already issues from even-earlier MySQL support (TYPE vs ENGINE) so going 5.0+ only will fix some of that.

Quotebut we can't rewrite half of SMF to support only UTF8 and still expect to get 2.1 final out by the end of the year.

It's 3 days work to gut the innards and replace it with UTF-8 only, a week tops. (And before anyone tells me otherwise... I already did this. It took 2 days and sporadic bug fixes thereafter, totalling no more than 3 days work for me.)

Your biggest problem there is the upgrader, not the core of SMF.

Given the flack I received from emanuele about wanting to push a major feature improvement into 2.1, I sincerely doubt we'll see any changes to support newer versions of MySQL and/or drop support for non-UTF8 languages for that version. One can dream though :)
Michael Eshom
Christian Metal Fans

redone

It would make sense for Arantor to share his fix? I would not consider this a "new-feature" though typically versions do get feature frozen for obvious reasons.

Seems fairly common sense to me. Maybe I am crazy! ;)

~redone

Biology Forums

Hi Digger, a bit off topic, but how did you get ulogin to work on your smf website?

Arantor

The fix involves replacing hundreds and hundreds of changes to SMF, to make it UTF-8 only. It is not a simple fix and providing even the diff would be useless as a great many changes had already occurred by that time.

^HeRaCLeS^

I'm in my modifications solved this as follows:

htmlspecialchars($string,ENT_QUOTES, $context['character_set']);

Not if it's the best way, but it works.
^HeRaCLeS^
*¤×• Ni te molestes en enviarme un Mp porque el soporte lo doy solo por el foro •×¤*


SMFPersonal

digger

Quote from: Liam_michael on March 23, 2013, 02:08:58 PM
Hi Digger, a bit off topic, but how did you get ulogin to work on your smf website?
I just installed ulogin mod. Without any problems.

Matthew K.

Quote from: ^HeRaCLeS^ on April 12, 2013, 04:42:21 PM
I'm in my modifications solved this as follows:

htmlspecialchars($string,ENT_QUOTES, $context['character_set']);

Not if it's the best way, but it works.
And why not just use $smcFunc['htmlspecialchars'](); which takes into account character set automatically? Which was kind of already stated, is the reason for using $smcFunc['htmlspecialchars'](); over "plain" htmlspecialchars();.

^HeRaCLeS^

Labradoodle-360: I have a question ..
If you use that code is better... because it not used in throughout smf?

^HeRaCLeS^
*¤×• Ni te molestes en enviarme un Mp porque el soporte lo doy solo por el foro •×¤*


SMFPersonal

Advertisement: