SMF Support > SMF 2.0.x Support
Post not complete after hitting POST Button
Johnny B:
Sweet, I like that. There's always room for improvement my father always said, fingers crossed.
MrPhil:
I did some experimenting, and at least some browsers on some PCs will translate cut and pasted Smart Quotes into UTF-8 equivalents for forums set up as UTF-8. I don't know what will happen if the database is converted to UTF-8, but I suspect that any Smart Quotes already in the database won't be translated to UTF-8. Since this translation of Smart Quotes on the way in (or out) is probably done on the whim of the browser author, and I haven't heard of any official standard to handle it, I hate to recommend doing something major like converting your database and forum to UTF-8. There's no telling when that will work and when it won't.
Try this. It translates Smart Quotes "on the way out" to HTML entities. It should work for any single byte encoding (as well as UTF-8), but has only been tested on a UTF-8 SMF 2.0.2 forum, so use at your own risk and be ready to back it out. On UTF-8, it tries to determine if a byte sequence is already a legitimate UTF-8 character, and if so, doesn't touch it. Otherwise (and on single byte encodings such as Latin-1), it changes anything in the range x80 through x9F to HTML entities. UTF-8 is the only multibyte encoding I've tried -- it will probably fail on UTF-16 or other multibyte encodings, but very few people use those (and they should not install this patch).
It only is applied to text that is potentially subject to BBCode expansion (it is called from parse_bbc). Text that does not undergo BBCode processing (e.g., topic subjects) will not be fixed up. Perhaps someone with the time and inclination can fix that (it doesn't depend on BBCode; that was just a convenient place to call it from). This patch is just a Quick'n'Dirty; others are welcome to fix it up or redo it in more elegant code.
All changes are in Sources/Subs.php. Applying this to SMF 1.x will probably work, but you may have to work to find where in parse_bbc() to apply the changes.
Find
--- Code: --- // Never show smileys for wireless clients. More bytes, can't see it anyway :P.
--- End code ---
and change to
--- Code: --- // Just in case it wasn't determined yet whether UTF-8 is enabled.
if (!isset($context['utf8']))
$context['utf8'] = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';
// always fix up "Smart Quotes" going out to the browser, no matter what the encoding
$message = fix_SmartQuotes($message);
// Never show smileys for wireless clients. More bytes, can't see it anyway :P.
--- End code ---
Find
--- Code: --- // Just in case it wasn't determined yet whether UTF-8 is enabled.
if (!isset($context['utf8']))
$context['utf8'] = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';
--- End code ---
and change to (comment out):
--- Code: ---// // Just in case it wasn't determined yet whether UTF-8 is enabled.
// if (!isset($context['utf8']))
// $context['utf8'] = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';
--- End code ---
At the very end of the file, find
--- Code: ---?>
--- End code ---
and insert before it:
--- Code: ---// Translate MS Smart Quotes into HTML entities, so browsers don't choke
// on them as control codes. Some browsers treat Latin-1 as CP-1252, and
// some pages may already be in CP-1252; in either case this should be safe.
// This is done upon text output, and only for text which can have BBCodes.
// Database functions which might be searching or sorting text on UTF-8, and
// are run against the text before this routine is called, might still fail.
// author: SMF's MrPhil (see catskilltech.com)
function fix_SmartQuotes($message) {
global $context;
// sometimes a numeric entity is used, because the named entity
// may not yet be widely supported
$SQ_list = array(
// UTF SQ usage, UTF usage
// codepoint
0x80 => '€' , // 20AC Euro, reserved control
0x81 => '?', // unused, reserved control
0x82 => '‚' , // 201A Low-"9" opening quotation mark, Break Permitted Here
0x83 => '& #402;' , // 0192 ƒ Florin/script f/folder, No Break Here
0x84 => '„' , // 201E Low-"99" opening quotation mark, Index
0x85 => '…', // 2026 Ellipsis, Next Line
0x86 => '†', // 2020 Single dagger, Start of Selected Area
0x87 => '‡', // 2021 Double dagger, End of Selected Area
0x88 => 'ˆ' , // 02C6 Circumflex ^ accent (combining?), Character Tabulation Set
0x89 => '‰', // 2030 o/oo per mille, Character Tabulation with Justification
0x8A => '& #352;' , // 0160 &Scaron S + caron accent, Line Tabulation Set
0x8B => '‹', // 2039 Single left angle quote < (guillemet) Partial Line Down
0x8C => 'Œ' , // 0152 OE ligature, Partial Line Up
0x8D => '?', // unused, Reverse Line Feed
0x8E => '& #381;' , // 017D &Zcaron Z + caron accent, Single Shift Two
0x8F => '?', // unused, Single Shift Three
0x90 => '?', // unused, Device Control String
0x91 => '‘' , // 2018 "6" opening quotation mark, Private Use One
0x92 => '’' , // 2019 "9" closing quotation mark/apostrophe, Private Use Two
0x93 => '“' , // 201C "66" opening quotation mark, Set Transmit State
0x94 => '”' , // 201D "99" closing quotation mark, Cancel Character
0x95 => '•' , // 2022 Solid bullet, Message Waiting
0x96 => '–' , // 2013 En-dash, Start of Guarded Area
0x97 => '—' , // 2014 Em-dash, End of Guarded Area
0x98 => '˜' , // 02DC Tilde ~ accent (combining?), Start of String
0x99 => '™' , // 2122 Trademark TM, reserved control
0x9A => '& #353;' , // 0161 &scaron s + caron accent, Single Character Introducer
0x9B => '›', // 203A Single right angle quote > (guillemet), Control Sequence Introducer
0x9C => 'œ' , // 0153 oe ligature, String Terminator
0x9D => '?', // unused, Operating System Command
0x9E => '& #382;' , // 017E &zcaron z + caron accent, Privacy Message
0x9F => 'Ÿ' , // 0178 Y + diaeresis/umlaute accent, Application Program Command
);
$new_message = '';
if ($context['utf8']) {
// we are in multibyte UTF-8 mode, so need to skip legitimate UTF-8
// sequences that may contain x80-9F bytes inside them
// note that strlen($message) can vary as entities replace char bytes
for ($i = 0; $i < strlen($message); $i++) {
$c = ord($message[$i]);
// lead byte 110x xxxx, followed by one 10xx xxxx, or
// 1110 xxxx two
// 1111 0xxx three ?
// if so, is legitimate UTF-8 sequence, don't modify
$utf8_seq = 0; // not UTF-8 (zero 10xx xxxx bytes to follow)
$cm = $c & 0xE0;
if ($cm == 0xC0) {
$utf8_seq = 1;
} else {
$cm = $c & 0xF0;
if ($cm == 0xE0) {
$utf8_seq = 2;
} else {
$cm = $c & 0xF8;
if ($cm == 0xF0) {
$utf8_seq = 3;
}
}
}
for ($j = 0; $j < $utf8_seq; $j++) {
// j+1st following byte should be 10xx xxxx
// but first, are we running off the end of $message?
// shouldn't happen with well-formed UTF-8 characters...
if ($i+$j+1 >= strlen($message)) {
$utf8_seq = 0;
break;
}
$cm = ord($message[$i+$j+1]) & 0xC0;
if ($cm != 0x80) {
$utf8_seq = 0;
break;
}
}
// skip over next $utf8_seq bytes as a legitimate UTF-8 sequence
// or process single byte as possible Smart Quote
if ($utf8_seq == 0) {
if ($c >= 0x80 && $c <= 0x9F) {
$new_message .= $SQ_list[$c]; // replace by HTML entity
} else {
$new_message .= chr($c); // use original character
// note that originally malformed UTF-8 won't be fixed
}
} else {
$new_message .= substr($message, $i, $utf8_seq+1); // use original bytes
$i += $utf8_seq; // end of loop adds another 1
}
} // end of for loop through $message
} else {
// we are in a single byte mode, so go ahead and fix any
// x80-9F bytes
for ($i = 0; $i < strlen($message); $i++) {
$c = ord($message[$i]);
if ($c >= 0x80 && $c <= 0x9F) {
$new_message .= $SQ_list[$c];
} else {
$new_message .= chr($c);
}
}
}
return $new_message;
}
--- End code ---
There are five (5) entries in the $SQ_list table where you need to close up & and # (remove the space) to form proper HTML entities. This is because numeric entities get processed in the forum display into the actual characters.
Add: That last chunk of text is available as an attachment SQ_fix.txt. Just cut and paste it into place (just before the closing ?> ). There is no fixup needed as described above.
MrPhil:
Have you had a chance to try this code? Please let me know if it works for you, or if you experience problems. If it works well, it might get packaged up into a mod.
phlexx:
Please, has this been tested by anyone using SMF 2.0.2? does it work on 2.0.2?
MrPhil:
It was developed and tested (briefly) on SMF 2.0.2. Just be sure to back up the file before editing it, in case something goes wrong.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version