SMF Support > SMF 2.0.x Support

Post not complete after hitting POST Button

<< < (3/15) > >>

Johnny B:
Sweet, I like that. There's always room for improvement my father always said, fingers crossed.

MrPhil:
I did some experimenting, and at least some browsers on some PCs will translate cut and pasted Smart Quotes into UTF-8 equivalents for forums set up as UTF-8. I don't know what will happen if the database is converted to UTF-8, but I suspect that any Smart Quotes already in the database won't be translated to UTF-8. Since this translation of Smart Quotes on the way in (or out) is probably done on the whim of the browser author, and I haven't heard of any official standard to handle it, I hate to recommend doing something major like converting your database and forum to UTF-8. There's no telling when that will work and when it won't.

Try this. It translates Smart Quotes "on the way out" to HTML entities. It should work for any single byte encoding (as well as UTF-8), but has only been tested on a UTF-8 SMF 2.0.2 forum, so use at your own risk and be ready to back it out. On UTF-8, it tries to determine if a byte sequence is already a legitimate UTF-8 character, and if so, doesn't touch it. Otherwise (and on single byte encodings such as Latin-1), it changes anything in the range x80 through x9F to HTML entities. UTF-8 is the only multibyte encoding I've tried -- it will probably fail on UTF-16 or other multibyte encodings, but very few people use those (and they should not install this patch).

It only is applied to text that is potentially subject to BBCode expansion (it is called from parse_bbc). Text that does not undergo BBCode processing (e.g., topic subjects) will not be fixed up. Perhaps someone with the time and inclination can fix that (it doesn't depend on BBCode; that was just a convenient place to call it from). This patch is just a Quick'n'Dirty; others are welcome to fix it up or redo it in more elegant code.

All changes are in Sources/Subs.php. Applying this to SMF 1.x will probably work, but you may have to work to find where in parse_bbc() to apply the changes.

Find

--- Code: --- // Never show smileys for wireless clients.  More bytes, can't see it anyway :P.

--- End code ---
and change to

--- Code: --- // Just in case it wasn't determined yet whether UTF-8 is enabled.
if (!isset($context['utf8']))
$context['utf8'] = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';
        // always fix up "Smart Quotes" going out to the browser, no matter what the encoding
        $message = fix_SmartQuotes($message);
// Never show smileys for wireless clients.  More bytes, can't see it anyway :P.

--- End code ---

Find

--- Code: --- // Just in case it wasn't determined yet whether UTF-8 is enabled.
if (!isset($context['utf8']))
$context['utf8'] = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';

--- End code ---
and change to (comment out):

--- Code: ---// // Just in case it wasn't determined yet whether UTF-8 is enabled.
// if (!isset($context['utf8']))
// $context['utf8'] = (empty($modSettings['global_character_set']) ? $txt['lang_character_set'] : $modSettings['global_character_set']) === 'UTF-8';

--- End code ---

At the very end of the file, find

--- Code: ---?>
--- End code ---
and insert before it:

--- Code: ---// Translate MS Smart Quotes into HTML entities, so browsers don't choke
// on them as control codes. Some browsers treat Latin-1 as CP-1252, and
// some pages may already be in CP-1252; in either case this should be safe.
// This is done upon text output, and only for text which can have BBCodes.
// Database functions which might be searching or sorting text on UTF-8, and
// are run against the text before this routine is called, might still fail.
// author: SMF's MrPhil (see catskilltech.com)
function fix_SmartQuotes($message) {
  global $context;
  // sometimes a numeric entity is used, because the named entity
  // may not yet be widely supported
  $SQ_list = array(
                        // UTF  SQ usage, UTF usage
                        // codepoint
    0x80 => '&euro;'  , // 20AC Euro, reserved control
    0x81 => '?',        //      unused, reserved control
    0x82 => '&sbquo;' , // 201A Low-"9" opening quotation mark, Break Permitted Here
    0x83 => '& #402;'  , // 0192 &fnof; Florin/script f/folder, No Break Here
    0x84 => '&bdquo;' , // 201E Low-"99" opening quotation mark, Index
    0x85 => '&hellip;', // 2026 Ellipsis, Next Line
    0x86 => '&dagger;', // 2020 Single dagger, Start of Selected Area
    0x87 => '&Dagger;', // 2021 Double dagger, End of Selected Area
    0x88 => '&circ;'  , // 02C6 Circumflex ^ accent (combining?), Character Tabulation Set
    0x89 => '&permil;', // 2030 o/oo per mille, Character Tabulation with Justification
    0x8A => '& #352;'  , // 0160 &Scaron S + caron accent, Line Tabulation Set
    0x8B => '&lsaquo;', // 2039 Single left angle quote < (guillemet) Partial Line Down
    0x8C => '&OElig;' , // 0152 OE ligature, Partial Line Up
    0x8D => '?',        //      unused, Reverse Line Feed
    0x8E => '& #381;'  , // 017D &Zcaron Z + caron accent, Single Shift Two
    0x8F => '?',        //      unused, Single Shift Three
    0x90 => '?',        //      unused, Device Control String
    0x91 => '&lsquo;' , // 2018 "6" opening quotation mark, Private Use One
    0x92 => '&rsquo;' , // 2019 "9" closing quotation mark/apostrophe, Private Use Two
    0x93 => '&ldquo;' , // 201C "66" opening quotation mark, Set Transmit State
    0x94 => '&rdquo;' , // 201D "99" closing quotation mark, Cancel Character
    0x95 => '&bull;'  , // 2022 Solid bullet, Message Waiting
    0x96 => '&ndash;' , // 2013 En-dash, Start of Guarded Area
    0x97 => '&mdash;' , // 2014 Em-dash, End of Guarded Area
    0x98 => '&tilde;' , // 02DC Tilde ~ accent (combining?), Start of String
    0x99 => '&trade;' , // 2122 Trademark TM, reserved control
    0x9A => '& #353;'  , // 0161 &scaron s + caron accent, Single Character Introducer
    0x9B => '&rsaquo;', // 203A Single right angle quote > (guillemet), Control Sequence Introducer
    0x9C => '&oelig;' , // 0153 oe ligature, String Terminator
    0x9D => '?',        //      unused, Operating System Command
    0x9E => '& #382;'  , // 017E &zcaron z + caron accent, Privacy Message
    0x9F => '&Yuml;'  , // 0178 Y + diaeresis/umlaute accent, Application Program Command
  );
  $new_message = '';
  if ($context['utf8']) {
    // we are in multibyte UTF-8 mode, so need to skip legitimate UTF-8
    // sequences that may contain x80-9F bytes inside them
    // note that strlen($message) can vary as entities replace char bytes
    for ($i = 0; $i < strlen($message); $i++) {
      $c = ord($message[$i]);
      // lead byte 110x xxxx, followed by one 10xx xxxx, or
      //           1110 xxxx              two
      //           1111 0xxx              three        ?
      // if so, is legitimate UTF-8 sequence, don't modify
      $utf8_seq = 0;  // not UTF-8 (zero 10xx xxxx bytes to follow)
      $cm = $c & 0xE0;
      if ($cm == 0xC0) {
        $utf8_seq = 1;
      } else {
        $cm = $c & 0xF0;
if ($cm == 0xE0) {
  $utf8_seq = 2;
} else {
  $cm = $c & 0xF8;
  if ($cm == 0xF0) {
    $utf8_seq = 3;
  }
}
      }

      for ($j = 0; $j < $utf8_seq; $j++) {
        // j+1st following byte should be 10xx xxxx
// but first, are we running off the end of $message?
// shouldn't happen with well-formed UTF-8 characters...
if ($i+$j+1 >= strlen($message)) {
          $utf8_seq = 0;
  break;
        }
$cm = ord($message[$i+$j+1]) & 0xC0;
if ($cm != 0x80) {
          $utf8_seq = 0;
  break;
        }
      }

      // skip over next $utf8_seq bytes as a legitimate UTF-8 sequence
      // or process single byte as possible Smart Quote
      if ($utf8_seq == 0) {
        if ($c >= 0x80 && $c <= 0x9F) {
  $new_message .= $SQ_list[$c]; // replace by HTML entity
        } else {
          $new_message .= chr($c);      // use original character
  // note that originally malformed UTF-8 won't be fixed
        }
      } else {
        $new_message .= substr($message, $i, $utf8_seq+1); // use original bytes
$i += $utf8_seq;  // end of loop adds another 1
      }
    } // end of for loop through $message
  } else {
    // we are in a single byte mode, so go ahead and fix any
    // x80-9F bytes
    for ($i = 0; $i < strlen($message); $i++) {
      $c = ord($message[$i]);
      if ($c >= 0x80 && $c <= 0x9F) {
        $new_message .= $SQ_list[$c];
      } else {
        $new_message .= chr($c);
      }
    }
  }
  return $new_message;
}

--- End code ---
There are five (5) entries in the $SQ_list table where you need to close up & and # (remove the space) to form proper HTML entities. This is because numeric entities get processed in the forum display into the actual characters.

Add: That last chunk of text is available as an attachment SQ_fix.txt. Just cut and paste it into place (just before the closing ?> ). There is no fixup needed as described above.

MrPhil:
Have you had a chance to try this code? Please let me know if it works for you, or if you experience problems. If it works well, it might get packaged up into a mod.

phlexx:
Please, has this been tested by anyone using SMF 2.0.2? does it work on 2.0.2?

MrPhil:
It was developed and tested (briefly) on SMF 2.0.2. Just be sure to back up the file before editing it, in case something goes wrong.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version