SMF Development > Fixed or Bogus Bugs

[4981] [1.x, 2.0] handling MS Smart Quotes

<< < (8/9) > >>

MrPhil:
@Arantor, you keep contradicting yourself. First you say that Smart Quotes conversion (for UTF-8) is guaranteed, and then you say different browsers do it in different ways. What I want to know is if proper handling of Smart Quotes (conversion to standard code page's bytes) is guaranteed, at least for the vast majority of browsers in common use (including IE6). It doesn't matter what your personal experience has been -- what counts is whether there are standards and whether they're adhered to. Is this an official W3C specification, or is it up to individual browser authors?

If browsers are guaranteed to have no problem with UTF-8 encoding, when dealing with cut-and-pasted input text (not only Smart Quotes, but also any other encoded source), we move on to the question of why so many users report problems with Smart Quotes text, including those using UTF-8. Then comes the question of whether database conversion to UTF-8 is guaranteed to work properly (for existing posts). Finally, if browsers will handle Smart Quotes properly on input, what's to keep them from from applying the same transformations on output? If a given byte is a character on input, it should be the same character on output.

I can promise you that most people are not going to go back and edit their earlier posts to clean up after a conversion to UTF-8. If the conversion did not change control codes/Smart Quotes into legal characters, there will be problems. I can also promise you that very few people are going to use "Paste from Word", even if it's available. Those that do may well be hoping to bring over Word formatting rather than Smart Quotes characters.

Arantor:
No, you're just misreading what I'm saying.

UTF-8 has been handled consistently for a long time. This is in the specification.

CP-whatever to ISO is not handled consistently and never has been, because the specification doesn't cover it.

This is the bottom line of what I've been saying. And actually, you'd be surprised about people going back to edit posts - invariably it's reported early on, users make the change and thereafter it stops being a problem.

*shrug* At this point I might as well butt out because you're happy to keep raking over a problem that was solved years ago by everyone else. For my part I have no problem with what SMF does because for the system I actually care about, we haven't got any of this problem, and won't now ever have this exact problem. I just thought I'd try and bring the experience I learned in fighting the problem, but you're content to deal with fringe cases that do not generally present.

Most of the problems caused by misencoding are because people do crazy stuff like dropping in UTF-8 language files when nothing else is UTF-8 or vice versa. So much hassle for new users could actually be solved by simply defaulting UTF-8 to on in the install.

IchBin™:

--- Quote from: Arantor on May 30, 2012, 03:51:14 PM ---So much hassle for new users could actually be solved by simply defaulting UTF-8 to on in the install.

--- End quote ---

QFT! Couldn't agree more. :)

AngelinaBelle:
Clearly, the problem is very bothersome for what seems to be a small number of users.
And I've heard it is a big problem for Chinese Users.  This is possibly not due to Smart Quotes CP-1252, but another code page.

SMF, as far as I can tell, has no ability to convert msg bodies based on what MS code page may or may not have been used when the post was created.

Therefore some code ranges might be converted in a way that is not consistant with the "best match" mapping available from unicode.org.
The result could be "strange characters"

Therefore, the only way to fix the problem for this small number of users is to either tell them to find-and-fix any "very old" MS code page mess by hand (unlikely) or to create a freestanding UTF-8 conversion tool that can be given a mapping (very easily generated with a little regexp from the mappings files available at unicode.org) and apply it to "all msgs" (or even "list of msg_ids", for multi-language boards).
generating lists of messages is a separate problem, but can be done. "all msgs on board=10" for example.

Where multiple MS codepages are used in a msg (likely only on forums dealing with classical studies, linguistics, or translations), the whole thing could get even more complicated.  But the number would be VERY FEW to nonexistant, and need not be dealt with unless we actually spot that black swan.

MrPhil:
Since Smart Quotes characters are not necessarily consistent, would it help to have SMF offer several different possible translations on input, and let the user pick the right one? SMF could use that as a teachable moment to lecture users about cutting and pasting in Word documents. What happens even with UTF-8 translation of Smart Quotes -- since it receives only byte values and not the graphic pixels, does the browser know the encoding that was used? If so, can we query that source encoding?

For translation on output (my patch), any encoding information is long gone, and all we can do is use CP-1252 and hope for the best. I suppose it might be possible if Smart Quotes bytes are encountered, to ask the user if the text and punctuation looks all right, and try another translation table if the user says no, but it's conceivable there could be a different Smart Quotes set for each post (and possibly, even within one post!). I doubt that users are going to like answering questions about each post to fine tune the output translation. Some of the CP-12xx encodings replace punctuation characters with accented letters in the Smart Quotes range.

Damn Microsoft!

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version