[4981] [1.x, 2.0] handling MS Smart Quotes

MrPhil · April 28, 2012, 06:31:18 PM

Quote from: emanuele on April 28, 2012, 04:15:31 PM
First because SMF already does a lot of checks, secondly because a host changing its PHP setup is rare, but a user changing hist host is not so rare.

It just seems odd to me to have SMF go through the labor of testing access modes for various operations, on a regular basis. E.g., create a file with folder 755, then 775, and finally 777 until it works. Any logged error messages could really make forum owners nervous, so I wouldn't recommend doing this often. Invoke this utility in repair_settings.php, so it will be run whenever they change hosts (they need to run repair_settings anyway).

Quote
Code Select Expand
$mbname . ': SMF Database Error!', 'There has been a problem with the database!' . ($db_error == '' ? '' : "\n" . $smcFunc['db_title'] . ' reported:' . "\n" . $db_error) . "\n\n" . 'This is a notice email to let you know that SMF could not connect to the database, contact your host if this continues.');

$mbname is actually the first thing on the email.
How could it be more clear?
A link to the forum?
Where? At the end?

Well, evidently it's not clear enough to many users. Change "SMF could not connect to the database" in the message to "your SMF forum could not connect to its database". That ought to be reasonably clear.

IchBin™ · April 28, 2012, 08:19:43 PM

To the original topic, something related I saw today.
http://www.simplemachines.org/community/index.php?topic=474847

MrPhil · April 30, 2012, 11:01:46 AM

Yet another...

See http://www.simplemachines.org/community/index.php?topic=475480 . The problem is that WYSIWYG mode in the editor (including the presumably vanilla one in this very forum) adds all sorts of extra tags all over the place. It also corrupts hand-entered tags (non-WYSIWYG mode). Going into WYSIWYG mode and back out is not perfectly reversible -- cruft builds up.

Anyway, the WYSIWYG editor either needs some serious fixing, or removal. Besides, I'm not even sure it deserves the appellation "WYSIWYG". I should be able to click on a mode (e.g., font size) and start typing away in the new mode. Doing an immediate in-place preview after highlighting and pressing a button is not WYSIWYG in anyone's book.

Should I open up separate topics for each item I've listed? I seem to have accumulated quite a few bugs in this one topic...

Antechinus · April 30, 2012, 05:05:41 PM

You can just start typing away in whatever mode. I just did.

MrPhil · April 30, 2012, 05:38:24 PM

OK, that's odd. It didn't work when I tried it earlier. I cheerfully withdraw that complaint

I do notice that when I tested in WYSIWYG mode, and turned it off, there were some end tags dumped into the TEXTAREA window here that I had to delete. So, I guess that is still a problem.

Antechinus · April 30, 2012, 05:58:58 PM

Oh I'm not claiming the thing is perfect. I never use it myself though so I don't personally care. I much prefer the old editor.

emanuele · May 04, 2012, 10:05:10 AM

There are many reports about the WYSIWYG, I fixed some of them (in most cases I posted the fix in the topic), please have a look if one of those fix this specific problem (hopefully I posted in every related topic a "/me hates WYSIWYG" or similar that should allow you to find them searching for WYSIWYG in this and/or in the fixed or bogus boards).
And let's not forget the behaviour of the WYSIWYG editor is in several cases browser dependent, so it would be best to ask/specify the browser that causes problems.

Additionally, yes, it would be better to have a topic per report (but a search before post would be highly appreciated).

MrPhil · May 13, 2012, 11:51:19 AM

Here's a patch for handling MS Smart Quotes: http://www.simplemachines.org/community/index.php?topic=476538.msg3333723#msg3333723

It's lightly tested, but might be a good starting place.

MrPhil · May 17, 2012, 11:20:25 AM

A report of success with the patch: http://www.simplemachines.org/community/index.php?topic=476917.msg3336118#msg3336118 . Any others?

Angelina Belle · May 29, 2012, 02:12:10 PM

SMF 2.0 does seem to try to do windows-1252 and windows-1253 to UTF-8 in function ConvertUtf8.
But I am not sure it uses the correct mappings. I wonder about 0x80 -- the euro sign, for example.

For a list of microsoft code page to UTF-8 mappings, please see http://en.wikipedia.org/wiki/Windows_code_page#List and follow links to information at unicode.org

I have used a little regex to use 1252 and 1253 information for characters 0x80 through 0x9F to convert these to php.
This could easily be repeated for the entirety of every one of these code pages.

The result could be made available as a separate file, so that all those conversion tables would only need to be loaded when UTF-8 conversion is required. Which is not nearly as often as ManageMaintenance is required.

Code (windows-1252) Select

'0x80' => '0x20ac',    //Euro Sign
'0x81' => '0x0081
'0x82' => '0x201a',    //Single Low-9 Quotation Mark
'0x83' => '0x0192',    //Latin Small Letter F With Hook
'0x84' => '0x201e',    //Double Low-9 Quotation Mark
'0x85' => '0x2026',    //Horizontal Ellipsis
'0x86' => '0x2020',    //Dagger
'0x87' => '0x2021',    //Double Dagger
'0x88' => '0x02c6',    //Modifier Letter Circumflex Accent
'0x89' => '0x2030',    //Per Mille Sign
'0x8a' => '0x0160',    //Latin Capital Letter S With Caron
'0x8b' => '0x2039',    //Single Left-Pointing Angle Quotation Mark
'0x8c' => '0x0152',    //Latin Capital Ligature Oe
'0x8e' => '0x017d',    //Latin Capital Letter Z With Caron
'0x91' => '0x2018',    //Left Single Quotation Mark
'0x92' => '0x2019',    //Right Single Quotation Mark
'0x93' => '0x201c',    //Left Double Quotation Mark
'0x94' => '0x201d',    //Right Double Quotation Mark
'0x95' => '0x2022',    //Bullet
'0x96' => '0x2013',    //En Dash
'0x97' => '0x2014',    //Em Dash
'0x98' => '0x02dc',    //Small Tilde
'0x99' => '0x2122',    //Trade Mark Sign
'0x9a' => '0x0161',    //Latin Small Letter S With Caron
'0x9b' => '0x203a',    //Single Right-Pointing Angle Quotation Mark
'0x9c' => '0x0153',    //Latin Small Ligature Oe
'0x9e' => '0x017e',    //Latin Small Letter Z With Caron
'0x9f' => '0x0178',    //Latin Capital Letter Y With Diaeresis

Windows-1253 is meant to support greek, but also includes symbols in the 0x80-0x9F range.

What symbols it does include in this range are identical to the symbols from the 1252 set.

Code (window-1253) Select


'0x80' => '0x20ac'    //Euro Sign
'0x82' => '0x201a'    //Single Low-9 Quotation Mark
'0x83' => '0x0192'    //Latin Small Letter F With Hook
'0x84' => '0x201e'    //Double Low-9 Quotation Mark
'0x85' => '0x2026'    //Horizontal Ellipsis
'0x86' => '0x2020'    //Dagger
'0x87' => '0x2021'    //Double Dagger
'0x89' => '0x2030'    //Per Mille Sign
'0x8b' => '0x2039'    //Single Left-Pointing Angle Quotation Mark
'0x91' => '0x2018'    //Left Single Quotation Mark
'0x92' => '0x2019'    //Right Single Quotation Mark
'0x93' => '0x201c'    //Left Double Quotation Mark
'0x94' => '0x201d'    //Right Double Quotation Mark
'0x95' => '0x2022'    //Bullet
'0x96' => '0x2013'    //En Dash
'0x97' => '0x2014'    //Em Dash
'0x99' => '0x2122'    //Trade Mark Sign
'0x9b' => '0x203a'    //Single Right-Pointing Angle Quotation Mark

MrPhil · May 30, 2012, 02:28:43 PM

One solution frequently given for this problem is to convert your forum to UTF-8. However, I would be concerned that this is relying on possibly non-standard behavior by the browser to convert Smart Quotes bytes into the equivalent UTF-8 character multibytes on input. Does anyone have any documentation guaranteeing this behavior for all reasonably recent browsers? I've seen enough reports of failures with UTF-8 to suspect that this is not universally implemented. How about if UTF-8 is not the page display encoding (e.g., Latin-1 is)? Some browsers actually use CP-1252 when Latin-1 is requested, but again, that's non-standard and risky. Some browsers may simply treat x80 through x9F as Smart Quotes (upon output), regardless of the encoding, which again is non-standard.

I think that SMF should take care of converting this range of single bytes to HTML entities, which is what my patch does, regardless of the encoding used. x80 through x9F should be interpreted as control codes in any encoding except CP-xxxx (Microsoft-specific), where it should do no harm to convert them to entities, even though they would be properly displayed anyway.

I just spent a great deal of time and effort with a member who was having trouble with Smart Quotes on a Latin-1 system. From the limited debugging I could get them to do, it sounds like the Smart Quotes were being cut off upon input, which means they didn't even survive to get to the database, and my patch would have no effect. That's yet another behavioral mode, but I'm not sure SMF could do anything about that, especially if it's the browser cutting off the input at the first control code it encounters. It's still possible that it's the database doing the dirty work, in which case the Smart Quotes might be translated upon input by SMF (my fix_SmartQuotes routine could probably be used on input). This still would not handle Smart Quotes already in the database, but that could be taken care of on output by my patch.

There remains the question of what the database itself will do with Smart Quotes when converting text to UTF-8. Will it assume that they are control codes (and leave them alone) or Smart Quotes (to be changed)? Do we need to give a hint by changing Latin-1 fields to CP-1252, etc.?

MrPhil · May 30, 2012, 02:37:39 PM

Quote from: AngelinaBelle on May 29, 2012, 02:12:10 PM
SMF 2.0 does seem to try to do windows-1252 and windows-1253 to UTF-8 in function ConvertUtf8.
But I am not sure it uses the correct mappings. I wonder about 0x80 -- the euro sign, for example.

For a list of microsoft code page to UTF-8 mappings, please see http://en.wikipedia.org/wiki/Windows_code_page#List and follow links to information at unicode.org

I have used a little regex to use 1252 and 1253 information for characters 0x80 through 0x9F to convert these to php.
This could easily be repeated for the entirety of every one of these code pages.

The result could be made available as a separate file, so that all those conversion tables would only need to be loaded when UTF-8 conversion is required. Which is not nearly as often as ManageMaintenance is required.

I wouldn't worry about CP-1253 being a subset of CP-1252 in the Smart Quotes range. It's unlikely that CP-1253 text would include any unused characters found in CP-1252. If they do (say, text cut and pasted from CP-1252 Word), no biggie... it's translated correctly. A bigger problem would be any conflicts between the two (both use a given byte value, but for different glyphs). Have you seen such a thing? That would mean that my patch would have to be extended to be sensitive to the particular input encoding (and even then, how can we tell what the original source was for this presumably cut and pasted text?).

Don't worry about the Greeks not being able to handle a Smart Quotes Euro -- they'll be dropping the Euro soon enough!

Add: I just took a very quick look at the CP-xxxx pages, and at a minimum it appears that 1250, 1251, 1256, and 1257 are not proper subsets or supersets of 1252 (in the x80 - x9F Smart Quotes range). That is, they have different glyphs for some byte codes. This means my patch will have troubles with Smart Quotes pasted in from any of those encodings. It's still an improvement in that the browser won't choke on the bytes, but a failure in that the wrong glyph might be rendered. To pick the right one means that the encoding used for the Smart Quotes needs to be known, and saved with the post, and you know that 99.99% of users aren't going to have the slightest idea what an encoding is, much less which one was in use!

Arantor · May 30, 2012, 02:40:01 PM

It's so non-standard, all the major browsers have been doing it for *years*. Even IE 5 gets that right, let alone IE 6. It's how the specification actually says it should be handled, too.

The problem you're describing, of inconsistent behaviour of invalid character handling is browser specific and varies depending on how the given browser handles data in completely unsupportable situation (since CP-125? cannot be handled by ISO-8859-1, or -2). Some browsers send the character through as-is, some truncate.

You could, theoretically, attempt to fix some of those at the browser level with JavaScript before sending through the content. (It is of little surprise, then, that most of the WYSIWYG editors actually have a 'Paste from Word' entry for this very thing.) The reliability of this seems questionable to me, however.

The question of conversion is another big one, and it's one I tend to gloss over when suggesting conversion because invariably people seem to then re-edit the posts afterwards anyway meaning it is simply less of an issue.

Converting Latin-? to CP-whatever is almost as nightmarish a situation to be in as anything else, in fact.

The bottom line is that if you're using UTF-8 from the start, there's just not a problem - as evidenced here and in other places, and that's been the case for years too.

Angelina Belle · May 30, 2012, 03:02:19 PM

Converting any CP-1252 characters already in the database is something I don't understand.
These characters would have to have been in the database for a long time, if what Arantor says is true (Microsoft browsers as old as IE5 did the conversion correctly during "paste from WORD"). Of course, other, non-Microsoft, browsers, might not have handled, and may still not handle, this in the way the ordinary user would expect... I don't know.

1252 and 1253 seem to use the x80 - x9F codes in exactly the same way.
I have not figured out how SMF uses the translation tables it has for 1252 and 1253. The mappings don't seem to match those given at unicode.org, and would seem to result in a "strange character" instead of the euro symbol, and for each of the other smartquote characters, unless I completely misunderstand what is going on during conversion to UTF-8.

It is the "convert to UTF8" process that I would investigate. In addition to the x80 - x9F character questions, it does not cover various code pages that might have been used during "paste from Word" in several other languages. One would need to know WHICH code page to use. Greek. Chinese. Korean?

Arantor · May 30, 2012, 03:05:41 PM

There's two separate things there I indicated. Browsers as old as IE5 were able to handle the UTF-8 conversion as needed, but what we've seen more recently are WYSIWYG editors with a specific 'paste from Word' option, though I always understood that was to handle formatting, not characters themselves.

Note that it has always worked on *this* forum without any patches.

MrPhil · May 30, 2012, 03:10:47 PM

@Arantor, you keep contradicting yourself. First you say that Smart Quotes conversion (for UTF-8) is guaranteed, and then you say different browsers do it in different ways. What I want to know is if proper handling of Smart Quotes (conversion to standard code page's bytes) is guaranteed, at least for the vast majority of browsers in common use (including IE6). It doesn't matter what your personal experience has been -- what counts is whether there are standards and whether they're adhered to. Is this an official W3C specification, or is it up to individual browser authors?

If browsers are guaranteed to have no problem with UTF-8 encoding, when dealing with cut-and-pasted input text (not only Smart Quotes, but also any other encoded source), we move on to the question of why so many users report problems with Smart Quotes text, including those using UTF-8. Then comes the question of whether database conversion to UTF-8 is guaranteed to work properly (for existing posts). Finally, if browsers will handle Smart Quotes properly on input, what's to keep them from from applying the same transformations on output? If a given byte is a character on input, it should be the same character on output.

I can promise you that most people are not going to go back and edit their earlier posts to clean up after a conversion to UTF-8. If the conversion did not change control codes/Smart Quotes into legal characters, there will be problems. I can also promise you that very few people are going to use "Paste from Word", even if it's available. Those that do may well be hoping to bring over Word formatting rather than Smart Quotes characters.

Arantor · May 30, 2012, 03:51:14 PM

No, you're just misreading what I'm saying.

UTF-8 has been handled consistently for a long time. This is in the specification.

CP-whatever to ISO is not handled consistently and never has been, because the specification doesn't cover it.

This is the bottom line of what I've been saying. And actually, you'd be surprised about people going back to edit posts - invariably it's reported early on, users make the change and thereafter it stops being a problem.

*shrug* At this point I might as well butt out because you're happy to keep raking over a problem that was solved years ago by everyone else. For my part I have no problem with what SMF does because for the system I actually care about, we haven't got any of this problem, and won't now ever have this exact problem. I just thought I'd try and bring the experience I learned in fighting the problem, but you're content to deal with fringe cases that do not generally present.

Most of the problems caused by misencoding are because people do crazy stuff like dropping in UTF-8 language files when nothing else is UTF-8 or vice versa. So much hassle for new users could actually be solved by simply defaulting UTF-8 to on in the install.

IchBin™ · May 30, 2012, 04:59:50 PM

Quote from: Arantor on May 30, 2012, 03:51:14 PMSo much hassle for new users could actually be solved by simply defaulting UTF-8 to on in the install.

QFT! Couldn't agree more.

Angelina Belle · May 31, 2012, 02:48:18 PM

Clearly, the problem is very bothersome for what seems to be a small number of users.
And I've heard it is a big problem for Chinese Users. This is possibly not due to Smart Quotes CP-1252, but another code page.

SMF, as far as I can tell, has no ability to convert msg bodies based on what MS code page may or may not have been used when the post was created.

Therefore some code ranges might be converted in a way that is not consistant with the "best match" mapping available from unicode.org.
The result could be "strange characters"

Therefore, the only way to fix the problem for this small number of users is to either tell them to find-and-fix any "very old" MS code page mess by hand (unlikely) or to create a freestanding UTF-8 conversion tool that can be given a mapping (very easily generated with a little regexp from the mappings files available at unicode.org) and apply it to "all msgs" (or even "list of msg_ids", for multi-language boards).
generating lists of messages is a separate problem, but can be done. "all msgs on board=10" for example.

Where multiple MS codepages are used in a msg (likely only on forums dealing with classical studies, linguistics, or translations), the whole thing could get even more complicated. But the number would be VERY FEW to nonexistant, and need not be dealt with unless we actually spot that black swan.

MrPhil · May 31, 2012, 05:52:53 PM

Since Smart Quotes characters are not necessarily consistent, would it help to have SMF offer several different possible translations on input, and let the user pick the right one? SMF could use that as a teachable moment to lecture users about cutting and pasting in Word documents. What happens even with UTF-8 translation of Smart Quotes -- since it receives only byte values and not the graphic pixels, does the browser know the encoding that was used? If so, can we query that source encoding?

For translation on output (my patch), any encoding information is long gone, and all we can do is use CP-1252 and hope for the best. I suppose it might be possible if Smart Quotes bytes are encountered, to ask the user if the text and punctuation looks all right, and try another translation table if the user says no, but it's conceivable there could be a different Smart Quotes set for each post (and possibly, even within one post!). I doubt that users are going to like answering questions about each post to fine tune the output translation. Some of the CP-12xx encodings replace punctuation characters with accented letters in the Smart Quotes range.

Damn Microsoft!

News:

[4981] [1.x, 2.0] handling MS Smart Quotes

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil

MrPhil