News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Pasted text with special characters like single/double quotes gets cut off

Started by ArrayInteractive, April 02, 2012, 10:00:25 PM

Previous topic - Next topic

ArrayInteractive

Hey Folks,

I've seen a few people mention this, but haven't found a solution. So I hope I can add more info.

Many of my users copy large chunks of text from external sources like Word Documents to paste into forum posts. Sometimes special characters like single/double quotes (the curled ones vs the straight ones) will get copied over also. In the post preview everything displays fine. But once the user hits the post button, the post content is cut off at the first curly quote.

I figure something is happening to the copy when it is either being written to the database or being prepared for the database. Is there somewhere the special characters could be stripped/converted?

I find the best workaround for now is to tell people to first paste the text into Notepad, then copy it from notepad to the forum post. This strips all special characters. However this is much easier said than done for some folks ;)

Any thoughts?

Thanks.
smf 2.0.2


MrPhil

The referenced post gives partially incorrect information. The problem is that Microsoft created their own extension of the Latin-1 character encoding (called CP-1252) that grabs a bunch of reserved control characters and uses them for "smart quotes". If your page is displaying in anything but CP-1252, there's a chance that those characters will be interpreted as control codes, messing up your display. Cutting and pasting does not change character codes while keeping the same glyph -- it just glomps in the same bytes as used in the source, regardless of what those bytes mean in the forum's character encoding.

See my sig > FAQs > Microsoft "Smart Quotes" for more information. The best long-term solution would be a feature (or even mod) in SMF to look through entered text and replace Smart Quotes with HTML entities (or plain quotes), if the forum is not CP-1252 encoded. (This assumes that all the Smart Quotes characters will survive to be examined by PHP code.) The next best solution would be to change Latin-1 encoding to CP-1252, but this won't work if you need non-Latin characters (are using UTF-8).

The referenced post is wrong in that changing to UTF-8 will do nothing to help. The characters in use are incorrect and unusable in either Latin-1 or UTF-8.

kat

Mr. P... Would you mind if I corrected that entry, as per your observations?

Even better, do you wanna do it, if you're able to? ;)



ArrayInteractive

Yup that's definitely the problem. Thanks for the details.

Now to 'try' and explain what's going on and workarounds to my members... <sigh>

:)
smf 2.0.2

MrPhil

Since this is such a common problem (cut and paste Word text with "smart quotes" into SMF, resulting in invalid bytes), I wish that SMF would be updated to always search input text for these characters and replace them by either regular ASCII characters, or HTML entities. How 'bout it, core and mod authors?

ArrayInteractive

I keep getting lots of member complaints about this on my forum. Hopefully soon I'll go poking around and see if I can make the adjustment. It's probably not that hard to add some extra rules into the string handling. The harder part is probably finding where this actually needs to happen.

Anyone happen to know where the function is that writes posts to the database?
smf 2.0.2

keyrocks

I installed my first SMF project online a few days ago and have a few users helping out as 'beta' testers before we officially launch it. They've all reported the same problem.

Copying and pasting from MS Word into a Notepad file, then from there into the SMF textarea. doesn't always work.
For example, a word with an apostrophe taken from an MS Word document will still cut the text off at the apostrophe.

Test string: Here is a test of the word we'll pasted here to see if the text cuts off after we - we'll - is this showing? Yes it is, on this Forum... but it won't on my SMF 2.0.2 Forum. Why is that?

Storman™

Please don't hijack someone else's topic especially when it's 6 weeks old.

However, that aside I think you'll find the answer to your question in this topic above and the posted links if you reread carefully.

keyrocks

Quote from: Storman on May 18, 2012, 11:43:54 AM
Please don't hijack someone else's topic especially when it's 6 weeks old.
However, that aside I think you'll find the answer to your question in this topic above and the posted links if you reread carefully.

I don't lnow what you mean by "don't hijack someone else's topic especially when it's 6 weeks old"... the age of the topic should not matter when the problem persists... and hijacking... is there an issue here with posting on a topic started by someone else? I did read the Wiki post and have figured that out just fine, thanks.  :)

ArrayInteractive

Hmmm I thought the wiki article just gave an explanation, not an actual solution for SMF. Will check again. I never did get to hacking around for fix and this problem continues to plague my members.
smf 2.0.2


ArrayInteractive

Thanks for the patch. I made the changes, but did not work in my case as my forum is set to ISO-8859-1. I'm a little leary about converting to utf8, but may give it a go on a rainy day. My users would be very thankful for a fix.
smf 2.0.2

Arantor

The only real fix is to convert to UTF-8, which will prevent it in the future.

MrPhil

This subject has been addressed many times before. The problem is that MS uses code page 1252 with "Smart Quotes" in its PC products. When you cut and paste this text into a different code page, such as ISO-8859-1/Latin-1 or UTF-8, the byte codes are generally interpreted as control codes (x80 through x9F range, used by Smart Quotes but standardized as control codes). This results in text being cut off either upon entry or upon display.

It is reported that a number of browsers, when your page is displayed in UTF-8 encoding, will properly recognize this common problem (situation) and translate pasted text into the appropriate UTF-8 byte sequences (giving you the correct graphics). This is why it doesn't work with older posts -- it's too late to translate the Smart Quotes, which are already in the database. This seems to work quite often, but as it is not an official standard, I wouldn't rely on it. You're better off editing text in a plain text editor, or configuring your Word or Outlook to not use Smart Quotes, of just being careful to manually clean up all Smart Quotes before saving a post. You could also manually change your page display encoding to CP-1252 (Windows), which is the same as Latin-1, except that the upper control codes have been replaced by Smart Quotes characters. Most browsers support CP-1252, and some even use CP-1252 when Latin-1 is requested.

Maybe some day SMF will be fixed to do this for you (translate Smart Quotes to something acceptable for the display encoding in use). Apparently some other forum and blog products already do this.

Arantor

QuoteMaybe some day SMF will be fixed to do this for you (translate Smart Quotes to something acceptable for the display encoding in use). Apparently some other forum and blog products already do this.

Most of the ones that do, do so by way of the 'Paste from Word' functionality as seen in TinyMCE.

The bottom line is that while you're technically right, most users don't care and most users just want it to work with the minimal amount of fuss - which for the vast majority is to just convert to UTF-8 whereupon the browser should be doing the work, and in the vast majority of cases, it does it just fine.

It's funny how every time this comes up, it stops being a problem after people convert to UTF-8. I would also note that since I migrated my own stuff to be strictly UTF-8 only, I never had any problems with it...

Antechinus


ArrayInteractive

Quote from: Arantor on October 16, 2012, 03:33:58 PM
QuoteIt's funny how every time this comes up, it stops being a problem after people convert to UTF-8. I would also note that since I migrated my own stuff to be strictly UTF-8 only, I never had any problems with it...

I've found some info on how to do the conversion. But was wondering if there was a standard defacto tried and true thread for how to convert?

smf 2.0.2

Advertisement: