SMF Development > Bug Reports

[4981] [1.x, 2.0] handling MS Smart Quotes

<< < (7/9) > >>

MrPhil:
One solution frequently given for this problem is to convert your forum to UTF-8. However, I would be concerned that this is relying on possibly non-standard behavior by the browser to convert Smart Quotes bytes into the equivalent UTF-8 character multibytes on input. Does anyone have any documentation guaranteeing this behavior for all reasonably recent browsers? I've seen enough reports of failures with UTF-8 to suspect that this is not universally implemented. How about if UTF-8 is not the page display encoding (e.g., Latin-1 is)? Some browsers actually use CP-1252 when Latin-1 is requested, but again, that's non-standard and risky. Some browsers may simply treat x80 through x9F as Smart Quotes (upon output), regardless of the encoding, which again is non-standard.

I think that SMF should take care of converting this range of single bytes to HTML entities, which is what my patch does, regardless of the encoding used. x80 through x9F should be interpreted as control codes in any encoding except CP-xxxx (Microsoft-specific), where it should do no harm to convert them to entities, even though they would be properly displayed anyway.

I just spent a great deal of time and effort with a member who was having trouble with Smart Quotes on a Latin-1 system. From the limited debugging I could get them to do, it sounds like the Smart Quotes were being cut off upon input, which means they didn't even survive to get to the database, and my patch would have no effect. That's yet another behavioral mode, but I'm not sure SMF could do anything about that, especially if it's the browser cutting off the input at the first control code it encounters. It's still possible that it's the database doing the dirty work, in which case the Smart Quotes might be translated upon input by SMF (my fix_SmartQuotes routine could probably be used on input). This still would not handle Smart Quotes already in the database, but that could be taken care of on output by my patch.

There remains the question of what the database itself will do with Smart Quotes when converting text to UTF-8. Will it assume that they are control codes (and leave them alone) or Smart Quotes (to be changed)? Do we need to give a hint by changing Latin-1 fields to CP-1252, etc.?

MrPhil:

--- Quote from: AngelinaBelle on May 29, 2012, 02:12:10 PM ---SMF 2.0 does seem to try to do windows-1252 and windows-1253 to UTF-8 in function ConvertUtf8.
But I am not sure it uses the correct mappings. I wonder about 0x80 -- the euro sign, for example.

For a list of microsoft code page to UTF-8 mappings, please see http://en.wikipedia.org/wiki/Windows_code_page#List and follow links to information at unicode.org

I have used a little regex to use 1252 and 1253 information for characters 0x80 through 0x9F to convert these to php.
This could easily be repeated for the entirety of every one of these code pages.

The result could be made available as a separate file, so that all those conversion tables would only need to be loaded when UTF-8 conversion is required.  Which is not nearly as often as ManageMaintenance is required.

--- End quote ---

I wouldn't worry about CP-1253 being a subset of CP-1252 in the Smart Quotes range. It's unlikely that CP-1253 text would include any unused characters found in CP-1252. If they do (say, text cut and pasted from CP-1252 Word), no biggie... it's translated correctly.  A bigger problem would be any conflicts between the two (both use a given byte value, but for different glyphs). Have you seen such a thing? That would mean that my patch would have to be extended to be sensitive to the particular input encoding (and even then, how can we tell what the original source was for this presumably cut and pasted text?).

Don't worry about the Greeks not being able to handle a Smart Quotes Euro -- they'll be dropping the Euro soon enough! ;)

Add: I just took a very quick look at the CP-xxxx pages, and at a minimum it appears that 1250, 1251, 1256, and 1257 are not proper subsets or supersets of 1252 (in the x80 - x9F Smart Quotes range). That is, they have different glyphs for some byte codes. This means my patch will have troubles with Smart Quotes pasted in from any of those encodings. It's still an improvement in that the browser won't choke on the bytes, but a failure in that the wrong glyph might be rendered. To pick the right one means that the encoding used for the Smart Quotes needs to be known, and saved with the post, and you know that 99.99% of users aren't going to have the slightest idea what an encoding is, much less which one was in use!

Arantor:
It's so non-standard, all the major browsers have been doing it for *years*. Even IE 5 gets that right, let alone IE 6. It's how the specification actually says it should be handled, too.

The problem you're describing, of inconsistent behaviour of invalid character handling is browser specific and varies depending on how the given browser handles data in completely unsupportable situation (since CP-125? cannot be handled by ISO-8859-1, or -2). Some browsers send the character through as-is, some truncate.

You could, theoretically, attempt to fix some of those at the browser level with JavaScript before sending through the content. (It is of little surprise, then, that most of the WYSIWYG editors actually have a 'Paste from Word' entry for this very thing.) The reliability of this seems questionable to me, however.

The question of conversion is another big one, and it's one I tend to gloss over when suggesting conversion because invariably people seem to then re-edit the posts afterwards anyway meaning it is simply less of an issue.

Converting Latin-? to CP-whatever is almost as nightmarish a situation to be in as anything else, in fact.

The bottom line is that if you're using UTF-8 from the start, there's just not a problem - as evidenced here and in other places, and that's been the case for years too.

AngelinaBelle:
Converting any CP-1252 characters already in the database is something I don't understand.
These characters would have to  have been in the database for a long time, if what Arantor says is true (Microsoft browsers as old as IE5 did the conversion correctly during "paste from WORD"). Of course, other, non-Microsoft, browsers, might not have handled, and may still not handle, this in the way the ordinary user would expect... I don't know.

1252 and 1253 seem to use the x80 - x9F codes in exactly the same way.
I have not figured out how SMF uses the translation tables it has for 1252 and 1253.  The mappings don't seem to match those given at unicode.org, and would seem to result in a "strange character" instead of the euro symbol, and for each of the other smartquote characters, unless I completely misunderstand what is going on during conversion to UTF-8.

It is the "convert to UTF8" process that I would investigate.  In addition to the x80 - x9F character questions, it does not cover various code pages that might have been used during "paste from Word" in several other languages. One would need to know WHICH code page to use. Greek. Chinese. Korean?

Arantor:
There's two separate things there I indicated. Browsers as old as IE5 were able to handle the UTF-8 conversion as needed, but what we've seen more recently are WYSIWYG editors with a specific 'paste from Word' option, though I always understood that was to handle formatting, not characters themselves.

Note that it has always worked on *this* forum without any patches.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version