[4981] [1.x, 2.0] handling MS Smart Quotes

Started by MrPhil, April 25, 2012, 01:16:42 PM

Previous topic - Next topic

Angelina Belle

Here's an order of operations for dealing with MOST OF these code page problems:

1) Allow the user to optionally apply one or more CP -> UTF-8 mappings in order. EXCEPT do not replace "unassigned" with some default character.  Is it possible to have a post in Korean with some content from one of the Cyrillic code pages?  Or can step 1 be limited to 1 code page?  One of the choices here is Latin 1; Western European (CP-1252)
-- This step should take care of most of the "weird characters" users have complained about by properly translating them to the "best fit" UTF-8, based on the chosen code page they originated from.

2) In all cases, finish by applying the section 0x80 through 0x9F from CP-1252  -> UTF-8
-- This should take care of all of the "post is cut off" or "page is ruined" problems caused by  these control characters (0x80 - 0x9F) appearing in a UTF-8 XHTML document.

3) If there are any remaining "post is cut off" problems, replace all "self-mapped" control characters in this range with something "?".  These are 0x81, 0x8d, 0x8f, 0x90, 0x9d.




http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

Code pages mappings which DO NOT ENTIRELY AGREE with  CP-1252 -> UTF-8 mappings:
cp-1251 -- Eastern European, which use some of these codes differently, but some the same.  The Eurosign is moved to 0x88, for example
cp-1256 -- Arabic -- many differences.
cp-1247 -- Baltic -- several differences

This means it is ALWAYS best to apply a language-specific code page mapping to take care of "weird characters" (ms code page characters that were not correctly mapped to the best-fit UTF-8 code) BEFORE trying to fix the 0x80 - 0x9F "smart-quotes" characters that cause the "post cut off" problems.


More information is available in the readme file http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

MrPhil

Well, given that people will cut and paste from Word documents until the End of Time, and some browsers will interpret x80 through x9F as control codes (if they haven't fixed up the pasted bytes already), something's got to be done. I don't care if you use my patch or do something else, but something needs to be done about it. Enough cases of breakage are reported, even with UTF-8 forums, to lead me to suspect that not all browsers are changing Smart Quotes characters to good UTF-8 characters upon input. In addition, there are older posts where Smart Quotes may or may not have been properly converted to UTF-8, and forums not in UTF-8. My patch is better than nothing, but there may be better ways to deal with this. Have at it!


Angelina Belle

Phil,

The use of XHTML entities is nice for non-UTF-8 forums.

So there are a couple of options

1) for dealing with current "SmartQuote" problems in non-UTF-8 or in UTF-8 forums -- your patch will work great.  This will deal properly with most of the Microsoft Code Page character problems.

2) for dealing with the UTF-8 conversion for forums where use of other microsoft code pages will cause "strange characters" after the UTF-8 conversion, a different thing would be needed. This would be important for posts in several non-Latin character sets, including Greek (cp-1253). SMF is currently doing something about1252 and 1253 character sets, but I'm not sure WHAT it does, exactly.
Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

Arantor

After discussing this with Oldiesmann, we're not going to fix this in 1.1 or 2.0, partly because 1.1 is about to be EOL'd, partly because 2.0 is only receiving security updates and minor bug fixes that aren't a problem to implement in a patch, partly because for both branches there is a viable workaround in converting to UTF-8 (and 2.1 is UTF-8 by default in any case, though it also has support for this) and the code is not the easiest to backport either.

Advertisement: