News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

[4981] [1.x, 2.0] handling MS Smart Quotes

Started by MrPhil, April 25, 2012, 01:16:42 PM

Previous topic - Next topic

MrPhil

The support boards for all versions of SMF are clogged with reports of certain characters cutting off the rest of a post, or otherwise apparently causing mischief. The root cause of these problems is that people cut and paste text from Microsoft products (especially Word) that contain MS's "Smart Quotes", which are found only in CP-1252 encoding. My proposal is that all incoming text (from TEXT, TEXTAREA, and possibly other input fields) be scanned for Smart Quotes characters (binary), and any found should be replaced by HTML entities. str_replace() might do the job. Here are all the Smart Quotes:


Smart QuoteGlyphClosest ASCIIUTF-16 valueHTML entitySQ DescriptionReserved Use
80€C=20ACeuroEuroreserved control
81 reserved control
82‚"201AsbquoLow-"9" opening quotation markBreak Permitted Here
83ƒf0192fnof1 or 402Florin/script f/folderNo Break Here
84„"201EbdquoLow-"99" opening quotation markIndex
85…...2026hellipEllipsisNext Line
86†+2020daggerSingle daggerStart of Selected Area
87‡++2021DaggerDouble daggerEnd of Selected Area
88ˆ^02C6circCircumflex ^ accent (combining?)Character Tabulation Set
89‰o/oo2030permilo/oo per milleCharacter Tabulation with Justification
8AŠ 0160Scaron1 or 352S + caron accentLine Tabulation Set
8B&lsaquo;<2039lsaquoSingle left angle quote < (guillemet)Partial Line Down
8C&OElig;OE0152OEligOE ligaturePartial Line Up
8D Reverse Line Feed
8E&#381; 017DZcaron1 or 381Z + caron accentSingle Shift Two
8F Single Shift Three
90 Device Control String
91&lsquo;   '2018lsquo"6" opening quotation markPrivate Use One
92&rsquo;'2019rsquo"9" closing quotation mark/apostrophePrivate Use Two
93&ldquo;"201Cldquo"66" opening quotation markSet Transmit State
94&rdquo;"201Drdquo"99" closing quotation markCancel Character
95&bull;*2022bullSolid bulletMessage Waiting
96&ndash;-2013ndashEn-dashStart of Guarded Area
97&mdash;--2014mdashEm-dashEnd of Guarded Area
98&tilde;~02DCtildeTilde ~ accent (combining?)Start of String
99&trade;(tm)2122tradeTrademark TMreserved control
9A&#353; 0161scaron1 or 353s + caron accentSingle Character Introducer
9B&rsaquo;>203ArsaquoSingle right angle quote > (guillemet)Control Sequence Introducer
9C&oelig;oe0153oeligoe ligatureString Terminator
9D Operating System Command
9E&#382; 017Ezcaron1 or 382z + caron accentPrivacy Message
9F&Yuml; 0178YumlY + diaeresis/umlaute accentApplication Program Command

Why call this a bug? Because it's been a festering problem for a long time, and really degrades the public image of SMF that it cannot handle something so common as cutting and pasting in Word document text. Just because your average user is too stupid to realize the difference in encodings is the cause of the problem doesn't mean that we can't work around it. It's also very simple to fix -- just define a function to clean the string and call it from wherever SMF takes in user text. Depending on whether BBCode is recognized, and whether HTML entities work, it might be possible to create either a BBCode for each character, or a BBCode to handle generic HTML entities [ent=nnnn] or [ent=name]. Where BBCode is not processed, replace by ASCII character(s).

kat

I can't see how anyone could ignore such a well-presented, detailed report, Mr. P.

Joshua Dickerson

Quote from: K@ on April 25, 2012, 01:29:56 PM
I can't see how anyone could ignore such a well-presented, detailed report, Mr. P.
Wow, yes... couldn't agree more. If there was an award for best bug report (without the fix), I think this might be it.
Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

kat

Let's keep our fingers crossed, then, ay? ;)

vbgamer45

I think it is a great idea. I have run into those issues were users post those special characters all the time and anything to fix I am for it.
Community Suite for SMF - Take your forum to the next level built for SMF, Gallery,Store,Classifieds,Downloads,more!

SMFHacks.com -  Paid Modifications for SMF

Mods:
EzPortal - Portal System for SMF
SMF Gallery Pro
SMF Store SMF Classifieds Ad Seller Pro

emanuele

A doc file with such entities would help in at least do tests. ;)

meh...Devs always want more... :P

BTW: it's always MS's fault!!! >:( :P


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Joshua Dickerson

Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

MrPhil

Wow. Maybe I should try complaining about the other things that irritate me to no end!

  • Fix or remove the SMF database backup
  • Really fix the Settings.php being emptied out problem
  • Get rid of the spurious "your database may require an upgrade" warning
  • Make mod install/uninstall either all or nothing -- refuse if ANY manual edits needed (most people ignore the manual edits) -- and validate that all changes were done properly (no more "undefined index" reports)
  • Ensure that version updates do the database update too -- there are far too many systems out there with way back-level databases (even if it's just the smfVersion entry)
  • Make conversion between encodings foolproof -- too many systems end up half-way changed, or a mix of encodings in the database or language files
  • Fix the calendar so that the min/max limits are deltas to the current year (say, CY - 1 through CY + 5, with hard limits 1970 - 2037), or else automatically update the database, so we don't get reports that "I can't enter a 2012 event -- it's too far in the future!"

I consider all of these to be bugs, rather than enhancements, and they should be fixed in SMF 1 too! Properly dealing with this list will greatly reduce the support load, and greatly improve SMF's image.

Antechinus

The rest sound good, but honestly I'm not sure about this one:

"Make mod install/uninstall either all or nothing -- refuse if ANY manual edits needed (most people ignore the manual edits)........................."

MrPhil

#9
Partially installed and partially uninstalled mods are the bane of SMF. Almost every "undefined index" error involves one. My impression is that people tend to accept partial actions and don't understand that they need to go back and manually edit the failed files. Therefore, I think that fully automatic installs/uninstalls should either do 100% of the job, or refuse to do any.

It would be nice to give some assistance on mod installs/uninstalls where SMF can handle some of the files itself, but we need to prevent people from walking off with the job half done. It might be enough to refuse to take the forum out of maintenance mode (or otherwise unlock it) until they have answered "yes" or "no" to each "Did you successfully manually edit file ________?" Maybe add, "do you swear upon your mother's grave..."?? There's probably no automated way to check that the work was done (else it could have been done fully automatically). If they answer "no" to any question, refuse to unlock the forum until they agree to uninstall/reinstall the mod, and can answer "yes" to all. Or, save a backup of each file affected by the mod, and restore all of them to completely roll back the action. If they say "yes" to a question, compare the current file to its backup, to see if something was done. There's all sorts of things that could be done.

If I had to do SMF over from scratch, I would not eval templates either. We always have to tell people to turn off eval and come back with the right error messages. Do you suppose it's possible to automatically turn off eval if an error is detected (and "eval" is in the message), and possibly purge the non-eval message?

Add: A related problem is that the package manager allows you to install a mod twice. As part of the pre-install check, it should see if the target code is already installed, and if so, refuse to proceed. This would eliminate all reports of "duplicate function defined".

emanuele

Quote from: MrPhil on April 25, 2012, 11:47:36 PM
It would be nice to give some assistance on mod installs/uninstalls where SMF can handle some of the files itself, but we need to prevent people from walking off with the job half done. It might be enough to refuse to take the forum out of maintenance mode (or otherwise unlock it) until they have answered "yes" or "no" to each "Did you successfully manually edit file ________?" Maybe add, "do you swear upon your mother's grave..."?? There's probably no automated way to check that the work was done (else it could have been done fully automatically). If they answer "no" to any question, refuse to unlock the forum until they agree to uninstall/reinstall the mod, and can answer "yes" to all.
And they will simply answer "yes" to all even if they didn't do anything at all...

Quote from: MrPhil on April 25, 2012, 11:47:36 PM
Or, save a backup of each file affected by the mod, and restore all of them to completely roll back the action. If they say "yes" to a question, compare the current file to its backup, to see if something was done. There's all sorts of things that could be done.
And so people will come and complain "the mod doesn't install, I installed it, but the function doesn't show up, etc., etc., etc." negatively affecting SMF's image... ;)

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
Wow. Maybe I should try complaining about the other things that irritate me to no end!
Usually a fix is more appreciated, but a report is as well. :P

Quote from: MrPhil on April 25, 2012, 05:54:38 PM

  • Fix or remove the SMF database backup
http://www.simplemachines.org/community/index.php?topic=474901.0

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
  • Really fix the Settings.php being emptied out problem
I don't remember the "final" decision about this for 2.1...

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
  • Get rid of the spurious "your database may require an upgrade" warning
AFAIR quite difficult...unfortunately.
Unless we get rid of them entirely.

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
  • Make mod install/uninstall either all or nothing -- refuse if ANY manual edits needed (most people ignore the manual edits) -- and validate that all changes were done properly (no more "undefined index" reports)
IMHO: a big no.

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
  • Ensure that version updates do the database update too -- there are far too many systems out there with way back-level databases (even if it's just the smfVersion entry)
AFAIR the SMF version is updated only where changes happen. So if during an update the database doesn't change the version is not incremented. But I may be wrong.

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
  • Make conversion between encodings foolproof -- too many systems end up half-way changed, or a mix of encodings in the database or language files
I would know what is needed to make it foolproof, maybe someone else has an idea?

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
  • Fix the calendar so that the min/max limits are deltas to the current year (say, CY - 1 through CY + 5, with hard limits 1970 - 2037), or else automatically update the database, so we don't get reports that "I can't enter a 2012 event -- it's too far in the future!"
* emanuele thinks about it, but it could need a bit of changes...

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
I consider all of these to be bugs, rather than enhancements, and they should be fixed in SMF 1 too!
Quite difficult they will be fixed for 1.x, we can fix some of these in 2.1, but even 2.0 wuld probably not be update with the relative "fixes".


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

MrPhil

Quote from: emanuele on April 26, 2012, 05:30:04 AM
And they will simply answer "yes" to all even if they didn't do anything at all...
True, but then they will be caught and their lies exposed when SMF compares the current file against the old backup. If they didn't bother to do any editing, the install/uninstall can then be rolled back, and the user informed why.

Quote
And so people will come and complain "the mod doesn't install, I installed it, but the function doesn't show up, etc., etc., etc." negatively affecting SMF's image... ;)
Well, they were told why the mod didn't install (they didn't make the manual edits). If they want to ****** and moan after that, there's not much we can do. I think it's still better to refuse to install than to allow a half-assed job.

Quote
Usually a fix is more appreciated, but a report is as well. :P
I do offer fixes on many of these -- see my sig, especially "Fixes".


  • Fix or remove the SMF database backup
    Quotehttp://www.simplemachines.org/community/index.php?topic=474901.0
    Just for the record, K@ started that topic after we had a PM gripe session over all the long-festering SMF bugs.

  • Really fix the Settings.php being emptied out problem
    QuoteI don't remember the "final" decision about this for 2.1...
    I would hope that this would be addressed for all SMF versions, not just 2.1. Come on, guys, the fix is very simple!

  • Get rid of the spurious "your database may require an upgrade" warning
    QuoteAFAIR quite difficult...unfortunately.
    Unless we get rid of them entirely.
    I offer a fix in my sig. It's probably more complex than need be, but the object is to remember which database levels (smfVersion) are acceptable for the current code version. We have both the code version and the database version available -- it's trivial  to keep the information somewhere (new function?) as to what database versions work for the given code version.

  • Make mod install/uninstall either all or nothing -- refuse if ANY manual edits needed (most people ignore the manual edits) -- and validate that all changes were done properly (no more "undefined index" reports)
    QuoteIMHO: a big no.
    A Big Yes. Something has to be done to keep people from partially installing mods (or uninstalling them) by ignoring warnings that they need to do manual edits.

  • Ensure that version updates do the database update too -- there are far too many systems out there with way back-level databases (even if it's just the smfVersion entry)
    QuoteAFAIR the SMF version is updated only where changes happen. So if during an update the database doesn't change the version is not incremented. But I may be wrong.
    If you're right, that's a stupid way to do it. If we're going to compare smfVersion to the code level and scare users with ominous warnings about mismatched version, we need to update smfVersion to match the code version. Otherwise, add code (see above) to allow still-working earlier smfVersion.

  • Make conversion between encodings foolproof -- too many systems end up half-way changed, or a mix of encodings in the database or language files
    QuoteI would know what is needed to make it foolproof, maybe someone else has an idea?
    I'm not sure about this one, but we need some ideas. I see far too many systems where they've got a random mixture of database encoding, language support encoding(s), and display page encodings. For later SMF versions (2.1 or 3.0), perhaps the best solution would be to simply mandate UTF-8 for everything.

  • Fix the calendar so that the min/max limits are deltas to the current year (say, CY - 1 through CY + 5, with hard limits 1970 - 2037), or else automatically update the database, so we don't get reports that "I can't enter a 2012 event -- it's too far in the future!"
    Quotethinks about it, but it could need a bit of changes...
    The biggest problem would be to ensure that unchanged min/max entries (years) are interpreted as fixed years and not deltas (value >500, treat as year). Perhaps if the max year <= current year + 1, the admin could be prompted to change the values to deltas? It could even be done automatically, with an email to the admin telling what was done.

Quote from: MrPhil on April 25, 2012, 05:54:38 PM
I consider all of these to be bugs, rather than enhancements, and they should be fixed in SMF 1 too!
Quote
Quite difficult they will be fixed for 1.x, we can fix some of these in 2.1, but even 2.0 wuld probably not be update with the relative "fixes".
They're all serious bugs that greatly detract from SMF's base functionality and sully its reputation, and need to be fixed in all release branches. In most cases, acceptable fixes are quite trivial to implement. They may not be the most elegant code, but they do the job to keep SMF from failing.

MrPhil

To add two more to the list before they slip my mind again...

  • Consolidate hard coded permissions in one function, and during configuration find out if 755, 775, or 777 is the necessary permission for SMF to write to a directory (likewise for files 644, 664, or 666). If 777 is absolutely required, see about automatically changing it back to 755 after the upload operation is done.

    SMF still has hard coded 777 directory permissions, which won't even work on some systems (e.g., running suPHP), and are often a security hazard. Let's get permissions straightened out to use the least expansive permissions for any case.
  • With SMF 2 we seem to be getting lots of reports of problems with the SMF cache. It clearly has problems. Let's disable it by default until it can be fixed.

Hmm. how about a third:
  • Adjust the CSS to put some vertical space between list items.

emanuele

Quote from: MrPhil on April 27, 2012, 10:23:42 AM
  • Consolidate hard coded permissions in one function, and during configuration find out if 755, 775, or 777 is the necessary permission for SMF to write to a directory (likewise for files 644, 664, or 666). If 777 is absolutely required, see about automatically changing it back to 755 after the upload operation is done.

    SMF still has hard coded 777 directory permissions, which won't even work on some systems (e.g., running suPHP), and are often a security hazard. Let's get permissions straightened out to use the least expansive permissions for any case.
There are already 2 or 3 topics around in this board and is tracked in mantis.
There are a couple of ideas (don't remember if here or in some less public board), nothing definitive.

Quote from: MrPhil on April 27, 2012, 10:23:42 AM
  • With SMF 2 we seem to be getting lots of reports of problems with the SMF cache. It clearly has problems. Let's disable it by default until it can be fixed.
I'm aware of 2 problems:
1) the "/" in the keys (reported and tracked, waiting for a fix);
2) the "random" (but not exactly random) broken cached files that *should* be fixed in 2.1, but we cannot be sure until we have it tested in a (multitude of) "real-world" case(s)...

Is there any other problem with the cache?


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

MrPhil

Quote from: emanuele on April 27, 2012, 11:19:16 AM
There are already 2 or 3 topics around in this board and is tracked in mantis.
There are a couple of ideas (don't remember if here or in some less public board), nothing definitive.
Fine, so long as progress is made and soon SMF is only using proper permissions (not just hard coded 777). SMF could do some testing at the top of installation, and build a permissions.php file to hold the defines. Something like

<?php
// consolidate permissions in one place

define ('WRITABLE_FILE'0664);
define ('WRITABLE_DIR',  0775);

define ('GENERAL_FILE'0644);
define ('GENERAL_DIR',  0755);

define ('RO_FILE'0444);
define ('RO_DIR',  0555);
?>


and pull this in from Settings.php or something.

Quote
Is there any other problem with the cache?
I don't know of specific problems, except that I see a lot of "file not found" reports in the support board. Emptying out the cache seems to do the trick, but SMF still needs fixing. If it isn't easily fixed, and doesn't speed up the system all that much, I'd just get rid of it.

emanuele

Quote from: MrPhil on April 27, 2012, 12:10:13 PM
SMF could do some testing at the top of installation
Do tests is one of the proposals, but I'd do them more frequently than just during the installation.

Quote from: MrPhil on April 27, 2012, 12:10:13 PM
Quote
Is there any other problem with the cache?
I don't know of specific problems, except that I see a lot of "file not found" reports in the support board. Emptying out the cache seems to do the trick, but SMF still needs fixing.
Clean SMF without mods?
What cache level?
A list of topics could help in finding a pattern and reduce the possibilities.
The second problem I mentioned (that is actually the most important) should lead to "Parse error"-like messages, but not file not found.

Quote from: MrPhil on April 27, 2012, 12:10:13 PM
If it isn't easily fixed, and doesn't speed up the system all that much, I'd just get rid of it.
Reading your posts I start thinking we should get rid of everything! :P
Just, joking of course. ;)


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

MrPhil

Quote from: emanuele on April 27, 2012, 01:00:41 PM
Quote from: MrPhil on April 27, 2012, 12:10:13 PM
SMF could do some testing at the top of installation
Do tests is one of the proposals, but I'd do them more frequently than just during the installation.
My proposal is to test once, during installation, to see what permissions are needed for various directory and file operations. Write out a permissions.php file with the defines and use that everywhere permissions are set. Why would more frequent testing be needed? A host system changing their PHP setup will be a very rare event. In such cases, if the user has problems, they can either manually edit the permissions.php file, or run some utility to rebuild it.

P.S. Make sure that this works properly on Windows servers, too, even if we operate with Unix-style "ugo" permissions.

Quote
Quote from: MrPhil on April 27, 2012, 12:10:13 PM
Quote
Is there any other problem with the cache?
I don't know of specific problems, except that I see a lot of "file not found" reports in the support board. Emptying out the cache seems to do the trick, but SMF still needs fixing.
Clean SMF without mods?
What cache level?
A list of topics could help in finding a pattern and reduce the possibilities.
Anyone interested in pursuing this could search for "cache" without "browser" and find all the reports. They could then contact the member with the original problem and ask for further details.

MrPhil

'Nuther one. A simple fix, but perhaps more of an enhancement:
  • If SMF can't connect to the database, it sends out an email. The wording of this email is confusing to noobs and they post here, asking why SMF (this forum) is trying to connect to their database. The forum title may not be available, but perhaps $mbname or $boardurl from Settings.php could be used in the message to clarify what SMF is trying to connect?

kat

Even easier, just have it say "Your forum cannot communicate with your database."

emanuele

Quote from: MrPhil on April 27, 2012, 05:24:15 PM
My proposal is to test once, during installation, to see what permissions are needed for various directory and file operations. Write out a permissions.php file with the defines and use that everywhere permissions are set. Why would more frequent testing be needed? A host system changing their PHP setup will be a very rare event. In such cases, if the user has problems, they can either manually edit the permissions.php file, or run some utility to rebuild it.
First because SMF already does a lot of checks, secondly because a host changing its PHP setup is rare, but a user changing hist host is not so rare. :P

Quote from: MrPhil on April 27, 2012, 05:24:15 PM
Anyone interested in pursuing this could search for "cache" without "browser" and find all the reports. They could then contact the member with the original problem and ask for further details.
http://www.simplemachines.org/community/index.php?action=search2;search=cache+-browser

50 pages...sorry too much for me. I give up.

Quote from: MrPhil on April 28, 2012, 12:06:36 PM
'Nuther one. A simple fix, but perhaps more of an enhancement:
  • If SMF can't connect to the database, it sends out an email. The wording of this email is confusing to noobs and they post here, asking why SMF (this forum) is trying to connect to their database. The forum title may not be available, but perhaps $mbname or $boardurl from Settings.php could be used in the message to clarify what SMF is trying to connect?
$mbname . ': SMF Database Error!', 'There has been a problem with the database!' . ($db_error == '' ? '' : "\n" . $smcFunc['db_title'] . ' reported:' . "\n" . $db_error) . "\n\n" . 'This is a notice email to let you know that SMF could not connect to the database, contact your host if this continues.');

$mbname is actually the first thing on the email.
How could it be more clear?
A link to the forum?
Where? At the end?


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

MrPhil

Quote from: emanuele on April 28, 2012, 04:15:31 PM
First because SMF already does a lot of checks, secondly because a host changing its PHP setup is rare, but a user changing hist host is not so rare. :P
It just seems odd to me to have SMF go through the labor of testing access modes for various operations, on a regular basis. E.g., create a file with folder 755, then 775, and finally 777 until it works. Any logged error messages could really make forum owners nervous, so I wouldn't recommend doing this often. Invoke this utility in repair_settings.php, so it will be run whenever they change hosts (they need to run repair_settings anyway).

Quote
$mbname . ': SMF Database Error!', 'There has been a problem with the database!' . ($db_error == '' ? '' : "\n" . $smcFunc['db_title'] . ' reported:' . "\n" . $db_error) . "\n\n" . 'This is a notice email to let you know that SMF could not connect to the database, contact your host if this continues.');

$mbname is actually the first thing on the email.
How could it be more clear?
A link to the forum?
Where? At the end?
Well, evidently it's not clear enough to many users. Change "SMF could not connect to the database" in the message to "your SMF forum could not connect to its database". That ought to be reasonably clear.

IchBin™

IchBin™        TinyPortal

MrPhil

Yet another...


  • See http://www.simplemachines.org/community/index.php?topic=475480 . The problem is that WYSIWYG mode in the editor (including the presumably vanilla one in this very forum) adds all sorts of extra tags all over the place. It also corrupts hand-entered tags (non-WYSIWYG mode). Going into WYSIWYG mode and back out is not perfectly reversible -- cruft builds up.

    Anyway, the WYSIWYG editor either needs some serious fixing, or removal. Besides, I'm not even sure it deserves the appellation "WYSIWYG". I should be able to click on a mode (e.g., font size) and start typing away in the new mode. Doing an immediate in-place preview after highlighting and pressing a button is not WYSIWYG in anyone's book.

Should I open up separate topics for each item I've listed? I seem to have accumulated quite a few bugs in this one topic...

Antechinus

You can just start typing away in whatever mode. I just did. ;)

MrPhil

OK, that's odd. It didn't work when I tried it earlier. I cheerfully withdraw that complaint  :laugh:

I do notice that when I tested in WYSIWYG mode, and turned it off, there were some end tags dumped into the TEXTAREA window here that I had to delete. So, I guess that is still a problem.

Antechinus

Oh I'm not claiming the thing is perfect. I never use it myself though so I don't personally care. I much prefer the old editor.

emanuele

There are many reports about the WYSIWYG, I fixed some of them (in most cases I posted the fix in the topic), please have a look if one of those fix this specific problem (hopefully I posted in every related topic a "/me hates WYSIWYG" or similar that should allow you to find them searching for WYSIWYG in this and/or in the fixed or bogus boards).
And let's not forget the behaviour of the WYSIWYG editor is in several cases browser dependent, so it would be best to ask/specify the browser that causes problems. ;)

Additionally, yes, it would be better to have a topic per report (but a search before post would be highly appreciated).


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

MrPhil

Here's a patch for handling MS Smart Quotes: http://www.simplemachines.org/community/index.php?topic=476538.msg3333723#msg3333723

It's lightly tested, but might be a good starting place.


Angelina Belle

SMF 2.0 does seem to try to do windows-1252 and windows-1253 to UTF-8 in function ConvertUtf8.
But I am not sure it uses the correct mappings. I wonder about 0x80 -- the euro sign, for example.

For a list of microsoft code page to UTF-8 mappings, please see http://en.wikipedia.org/wiki/Windows_code_page#List and follow links to information at unicode.org

I have used a little regex to use 1252 and 1253 information for characters 0x80 through 0x9F to convert these to php.
This could easily be repeated for the entirety of every one of these code pages.

The result could be made available as a separate file, so that all those conversion tables would only need to be loaded when UTF-8 conversion is required.  Which is not nearly as often as ManageMaintenance is required.

Code (windows-1252) Select
'0x80' => '0x20ac',    //Euro Sign
'0x81' => '0x0081
'0x82' => '0x201a',    //Single Low-9 Quotation Mark
'0x83' => '0x0192',    //Latin Small Letter F With Hook
'0x84' => '0x201e',    //Double Low-9 Quotation Mark
'0x85' => '0x2026',    //Horizontal Ellipsis
'0x86' => '0x2020',    //Dagger
'0x87' => '0x2021',    //Double Dagger
'0x88' => '0x02c6',    //Modifier Letter Circumflex Accent
'0x89' => '0x2030',    //Per Mille Sign
'0x8a' => '0x0160',    //Latin Capital Letter S With Caron
'0x8b' => '0x2039',    //Single Left-Pointing Angle Quotation Mark
'0x8c' => '0x0152',    //Latin Capital Ligature Oe
'0x8e' => '0x017d',    //Latin Capital Letter Z With Caron
'0x91' => '0x2018',    //Left Single Quotation Mark
'0x92' => '0x2019',    //Right Single Quotation Mark
'0x93' => '0x201c',    //Left Double Quotation Mark
'0x94' => '0x201d',    //Right Double Quotation Mark
'0x95' => '0x2022',    //Bullet
'0x96' => '0x2013',    //En Dash
'0x97' => '0x2014',    //Em Dash
'0x98' => '0x02dc',    //Small Tilde
'0x99' => '0x2122',    //Trade Mark Sign
'0x9a' => '0x0161',    //Latin Small Letter S With Caron
'0x9b' => '0x203a',    //Single Right-Pointing Angle Quotation Mark
'0x9c' => '0x0153',    //Latin Small Ligature Oe
'0x9e' => '0x017e',    //Latin Small Letter Z With Caron
'0x9f' => '0x0178',    //Latin Capital Letter Y With Diaeresis


Windows-1253 is meant to support greek, but also includes symbols in the 0x80-0x9F range.

What symbols it does include in this range are identical to the symbols from the 1252 set. 
Code (window-1253) Select

'0x80' => '0x20ac'    //Euro Sign
'0x82' => '0x201a'    //Single Low-9 Quotation Mark
'0x83' => '0x0192'    //Latin Small Letter F With Hook
'0x84' => '0x201e'    //Double Low-9 Quotation Mark
'0x85' => '0x2026'    //Horizontal Ellipsis
'0x86' => '0x2020'    //Dagger
'0x87' => '0x2021'    //Double Dagger
'0x89' => '0x2030'    //Per Mille Sign
'0x8b' => '0x2039'    //Single Left-Pointing Angle Quotation Mark
'0x91' => '0x2018'    //Left Single Quotation Mark
'0x92' => '0x2019'    //Right Single Quotation Mark
'0x93' => '0x201c'    //Left Double Quotation Mark
'0x94' => '0x201d'    //Right Double Quotation Mark
'0x95' => '0x2022'    //Bullet
'0x96' => '0x2013'    //En Dash
'0x97' => '0x2014'    //Em Dash
'0x99' => '0x2122'    //Trade Mark Sign
'0x9b' => '0x203a'    //Single Right-Pointing Angle Quotation Mark
Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

MrPhil

One solution frequently given for this problem is to convert your forum to UTF-8. However, I would be concerned that this is relying on possibly non-standard behavior by the browser to convert Smart Quotes bytes into the equivalent UTF-8 character multibytes on input. Does anyone have any documentation guaranteeing this behavior for all reasonably recent browsers? I've seen enough reports of failures with UTF-8 to suspect that this is not universally implemented. How about if UTF-8 is not the page display encoding (e.g., Latin-1 is)? Some browsers actually use CP-1252 when Latin-1 is requested, but again, that's non-standard and risky. Some browsers may simply treat x80 through x9F as Smart Quotes (upon output), regardless of the encoding, which again is non-standard.

I think that SMF should take care of converting this range of single bytes to HTML entities, which is what my patch does, regardless of the encoding used. x80 through x9F should be interpreted as control codes in any encoding except CP-xxxx (Microsoft-specific), where it should do no harm to convert them to entities, even though they would be properly displayed anyway.

I just spent a great deal of time and effort with a member who was having trouble with Smart Quotes on a Latin-1 system. From the limited debugging I could get them to do, it sounds like the Smart Quotes were being cut off upon input, which means they didn't even survive to get to the database, and my patch would have no effect. That's yet another behavioral mode, but I'm not sure SMF could do anything about that, especially if it's the browser cutting off the input at the first control code it encounters. It's still possible that it's the database doing the dirty work, in which case the Smart Quotes might be translated upon input by SMF (my fix_SmartQuotes routine could probably be used on input). This still would not handle Smart Quotes already in the database, but that could be taken care of on output by my patch.

There remains the question of what the database itself will do with Smart Quotes when converting text to UTF-8. Will it assume that they are control codes (and leave them alone) or Smart Quotes (to be changed)? Do we need to give a hint by changing Latin-1 fields to CP-1252, etc.?

MrPhil

#31
Quote from: AngelinaBelle on May 29, 2012, 02:12:10 PM
SMF 2.0 does seem to try to do windows-1252 and windows-1253 to UTF-8 in function ConvertUtf8.
But I am not sure it uses the correct mappings. I wonder about 0x80 -- the euro sign, for example.

For a list of microsoft code page to UTF-8 mappings, please see http://en.wikipedia.org/wiki/Windows_code_page#List and follow links to information at unicode.org

I have used a little regex to use 1252 and 1253 information for characters 0x80 through 0x9F to convert these to php.
This could easily be repeated for the entirety of every one of these code pages.

The result could be made available as a separate file, so that all those conversion tables would only need to be loaded when UTF-8 conversion is required.  Which is not nearly as often as ManageMaintenance is required.

I wouldn't worry about CP-1253 being a subset of CP-1252 in the Smart Quotes range. It's unlikely that CP-1253 text would include any unused characters found in CP-1252. If they do (say, text cut and pasted from CP-1252 Word), no biggie... it's translated correctly.  A bigger problem would be any conflicts between the two (both use a given byte value, but for different glyphs). Have you seen such a thing? That would mean that my patch would have to be extended to be sensitive to the particular input encoding (and even then, how can we tell what the original source was for this presumably cut and pasted text?).

Don't worry about the Greeks not being able to handle a Smart Quotes Euro -- they'll be dropping the Euro soon enough! ;)

Add: I just took a very quick look at the CP-xxxx pages, and at a minimum it appears that 1250, 1251, 1256, and 1257 are not proper subsets or supersets of 1252 (in the x80 - x9F Smart Quotes range). That is, they have different glyphs for some byte codes. This means my patch will have troubles with Smart Quotes pasted in from any of those encodings. It's still an improvement in that the browser won't choke on the bytes, but a failure in that the wrong glyph might be rendered. To pick the right one means that the encoding used for the Smart Quotes needs to be known, and saved with the post, and you know that 99.99% of users aren't going to have the slightest idea what an encoding is, much less which one was in use!

Arantor

It's so non-standard, all the major browsers have been doing it for *years*. Even IE 5 gets that right, let alone IE 6. It's how the specification actually says it should be handled, too.

The problem you're describing, of inconsistent behaviour of invalid character handling is browser specific and varies depending on how the given browser handles data in completely unsupportable situation (since CP-125? cannot be handled by ISO-8859-1, or -2). Some browsers send the character through as-is, some truncate.

You could, theoretically, attempt to fix some of those at the browser level with JavaScript before sending through the content. (It is of little surprise, then, that most of the WYSIWYG editors actually have a 'Paste from Word' entry for this very thing.) The reliability of this seems questionable to me, however.

The question of conversion is another big one, and it's one I tend to gloss over when suggesting conversion because invariably people seem to then re-edit the posts afterwards anyway meaning it is simply less of an issue.

Converting Latin-? to CP-whatever is almost as nightmarish a situation to be in as anything else, in fact.

The bottom line is that if you're using UTF-8 from the start, there's just not a problem - as evidenced here and in other places, and that's been the case for years too.

Angelina Belle

Converting any CP-1252 characters already in the database is something I don't understand.
These characters would have to  have been in the database for a long time, if what Arantor says is true (Microsoft browsers as old as IE5 did the conversion correctly during "paste from WORD"). Of course, other, non-Microsoft, browsers, might not have handled, and may still not handle, this in the way the ordinary user would expect... I don't know.

1252 and 1253 seem to use the x80 - x9F codes in exactly the same way.
I have not figured out how SMF uses the translation tables it has for 1252 and 1253.  The mappings don't seem to match those given at unicode.org, and would seem to result in a "strange character" instead of the euro symbol, and for each of the other smartquote characters, unless I completely misunderstand what is going on during conversion to UTF-8.

It is the "convert to UTF8" process that I would investigate.  In addition to the x80 - x9F character questions, it does not cover various code pages that might have been used during "paste from Word" in several other languages. One would need to know WHICH code page to use. Greek. Chinese. Korean?
Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

Arantor

There's two separate things there I indicated. Browsers as old as IE5 were able to handle the UTF-8 conversion as needed, but what we've seen more recently are WYSIWYG editors with a specific 'paste from Word' option, though I always understood that was to handle formatting, not characters themselves.

Note that it has always worked on *this* forum without any patches.

MrPhil

@Arantor, you keep contradicting yourself. First you say that Smart Quotes conversion (for UTF-8) is guaranteed, and then you say different browsers do it in different ways. What I want to know is if proper handling of Smart Quotes (conversion to standard code page's bytes) is guaranteed, at least for the vast majority of browsers in common use (including IE6). It doesn't matter what your personal experience has been -- what counts is whether there are standards and whether they're adhered to. Is this an official W3C specification, or is it up to individual browser authors?

If browsers are guaranteed to have no problem with UTF-8 encoding, when dealing with cut-and-pasted input text (not only Smart Quotes, but also any other encoded source), we move on to the question of why so many users report problems with Smart Quotes text, including those using UTF-8. Then comes the question of whether database conversion to UTF-8 is guaranteed to work properly (for existing posts). Finally, if browsers will handle Smart Quotes properly on input, what's to keep them from from applying the same transformations on output? If a given byte is a character on input, it should be the same character on output.

I can promise you that most people are not going to go back and edit their earlier posts to clean up after a conversion to UTF-8. If the conversion did not change control codes/Smart Quotes into legal characters, there will be problems. I can also promise you that very few people are going to use "Paste from Word", even if it's available. Those that do may well be hoping to bring over Word formatting rather than Smart Quotes characters.

Arantor

No, you're just misreading what I'm saying.

UTF-8 has been handled consistently for a long time. This is in the specification.

CP-whatever to ISO is not handled consistently and never has been, because the specification doesn't cover it.

This is the bottom line of what I've been saying. And actually, you'd be surprised about people going back to edit posts - invariably it's reported early on, users make the change and thereafter it stops being a problem.

*shrug* At this point I might as well butt out because you're happy to keep raking over a problem that was solved years ago by everyone else. For my part I have no problem with what SMF does because for the system I actually care about, we haven't got any of this problem, and won't now ever have this exact problem. I just thought I'd try and bring the experience I learned in fighting the problem, but you're content to deal with fringe cases that do not generally present.

Most of the problems caused by misencoding are because people do crazy stuff like dropping in UTF-8 language files when nothing else is UTF-8 or vice versa. So much hassle for new users could actually be solved by simply defaulting UTF-8 to on in the install.

IchBin™

Quote from: Arantor on May 30, 2012, 03:51:14 PMSo much hassle for new users could actually be solved by simply defaulting UTF-8 to on in the install.

QFT! Couldn't agree more. :)
IchBin™        TinyPortal

Angelina Belle

Clearly, the problem is very bothersome for what seems to be a small number of users.
And I've heard it is a big problem for Chinese Users.  This is possibly not due to Smart Quotes CP-1252, but another code page.

SMF, as far as I can tell, has no ability to convert msg bodies based on what MS code page may or may not have been used when the post was created.

Therefore some code ranges might be converted in a way that is not consistant with the "best match" mapping available from unicode.org.
The result could be "strange characters"

Therefore, the only way to fix the problem for this small number of users is to either tell them to find-and-fix any "very old" MS code page mess by hand (unlikely) or to create a freestanding UTF-8 conversion tool that can be given a mapping (very easily generated with a little regexp from the mappings files available at unicode.org) and apply it to "all msgs" (or even "list of msg_ids", for multi-language boards).
generating lists of messages is a separate problem, but can be done. "all msgs on board=10" for example.

Where multiple MS codepages are used in a msg (likely only on forums dealing with classical studies, linguistics, or translations), the whole thing could get even more complicated.  But the number would be VERY FEW to nonexistant, and need not be dealt with unless we actually spot that black swan.
Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

MrPhil

Since Smart Quotes characters are not necessarily consistent, would it help to have SMF offer several different possible translations on input, and let the user pick the right one? SMF could use that as a teachable moment to lecture users about cutting and pasting in Word documents. What happens even with UTF-8 translation of Smart Quotes -- since it receives only byte values and not the graphic pixels, does the browser know the encoding that was used? If so, can we query that source encoding?

For translation on output (my patch), any encoding information is long gone, and all we can do is use CP-1252 and hope for the best. I suppose it might be possible if Smart Quotes bytes are encountered, to ask the user if the text and punctuation looks all right, and try another translation table if the user says no, but it's conceivable there could be a different Smart Quotes set for each post (and possibly, even within one post!). I doubt that users are going to like answering questions about each post to fine tune the output translation. Some of the CP-12xx encodings replace punctuation characters with accented letters in the Smart Quotes range.

Damn Microsoft!

Angelina Belle

Here's an order of operations for dealing with MOST OF these code page problems:

1) Allow the user to optionally apply one or more CP -> UTF-8 mappings in order. EXCEPT do not replace "unassigned" with some default character.  Is it possible to have a post in Korean with some content from one of the Cyrillic code pages?  Or can step 1 be limited to 1 code page?  One of the choices here is Latin 1; Western European (CP-1252)
-- This step should take care of most of the "weird characters" users have complained about by properly translating them to the "best fit" UTF-8, based on the chosen code page they originated from.

2) In all cases, finish by applying the section 0x80 through 0x9F from CP-1252  -> UTF-8
-- This should take care of all of the "post is cut off" or "page is ruined" problems caused by  these control characters (0x80 - 0x9F) appearing in a UTF-8 XHTML document.

3) If there are any remaining "post is cut off" problems, replace all "self-mapped" control characters in this range with something "?".  These are 0x81, 0x8d, 0x8f, 0x90, 0x9d.




http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

Code pages mappings which DO NOT ENTIRELY AGREE with  CP-1252 -> UTF-8 mappings:
cp-1251 -- Eastern European, which use some of these codes differently, but some the same.  The Eurosign is moved to 0x88, for example
cp-1256 -- Arabic -- many differences.
cp-1247 -- Baltic -- several differences

This means it is ALWAYS best to apply a language-specific code page mapping to take care of "weird characters" (ms code page characters that were not correctly mapped to the best-fit UTF-8 code) BEFORE trying to fix the 0x80 - 0x9F "smart-quotes" characters that cause the "post cut off" problems.


More information is available in the readme file http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt

Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

MrPhil

Well, given that people will cut and paste from Word documents until the End of Time, and some browsers will interpret x80 through x9F as control codes (if they haven't fixed up the pasted bytes already), something's got to be done. I don't care if you use my patch or do something else, but something needs to be done about it. Enough cases of breakage are reported, even with UTF-8 forums, to lead me to suspect that not all browsers are changing Smart Quotes characters to good UTF-8 characters upon input. In addition, there are older posts where Smart Quotes may or may not have been properly converted to UTF-8, and forums not in UTF-8. My patch is better than nothing, but there may be better ways to deal with this. Have at it!


Angelina Belle

Phil,

The use of XHTML entities is nice for non-UTF-8 forums.

So there are a couple of options

1) for dealing with current "SmartQuote" problems in non-UTF-8 or in UTF-8 forums -- your patch will work great.  This will deal properly with most of the Microsoft Code Page character problems.

2) for dealing with the UTF-8 conversion for forums where use of other microsoft code pages will cause "strange characters" after the UTF-8 conversion, a different thing would be needed. This would be important for posts in several non-Latin character sets, including Greek (cp-1253). SMF is currently doing something about1252 and 1253 character sets, but I'm not sure WHAT it does, exactly.
Never attribute to malice that which is adequately explained by stupidity. -- Hanlon's Razor

Arantor

After discussing this with Oldiesmann, we're not going to fix this in 1.1 or 2.0, partly because 1.1 is about to be EOL'd, partly because 2.0 is only receiving security updates and minor bug fixes that aren't a problem to implement in a patch, partly because for both branches there is a viable workaround in converting to UTF-8 (and 2.1 is UTF-8 by default in any case, though it also has support for this) and the code is not the easiest to backport either.

Advertisement: