Strange symbols are appearing in my topics

Started by meca1, March 10, 2010, 12:31:42 PM

Previous topic - Next topic

meca1

I'm not sure if this is the correct place to ask this -- but these strange symbols keep appearing in some of my older topics:

ÃÆ'ƒÂÃ,¢ÃÆ'Ã,¢Ã¢â‚¬Ã...¡Ã‚Ã,¬ÃÆ'Ã,¢Ã¢â‚¬Ã...¾Ã‚Ã,¢t ÃÆ'ƒÂÃ,¢ÃÆ'Ã,¢Ã¢â‚¬Ã...¡Ã‚Ã,¬ÃÆ'Ã,¢Ã¢â‚¬Ã...¾Ã‚Ã,Â

They always appear in older posts where an apostrophe, quotes or dashes should be. Can anyone tell me why? It's very frustrating and time-consuming to go back through all the threads and posts to fix them.... PLEASE HELP!

Kindred

someone is pasting from word, would be my first guess...   in other words, quotes which are special characters instead of just a simple "


but, if this is on your own forum, you should ask for help in the section for support on your version of SMF 1.1.x or 2.0
Also, when reporting a problem that you would like help with;
smf version, mods installed and URL to your forum/where the issue is appearing.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

kat

Hmmmm...


I'm wondering if you may have been hacked.


Do you have any strange files in your forum directory?


As Kindred said, if you tell us which version of SMF you're using, we'd have a clue, at least.

meca1

No, there are no strange posts. And Im very careful about my members. I check them all out before approval to keep spammers out. In hoping this is what you're asking, it says Version SMF 1.1.11

kat

Not "Posts". Files. Are all your files OK?

meca1

Oops - sorry. There are no strange files that I'm aware of. And I haven't noticed anything else that's strange -- just this.

kat

In that case, I'd suspect a language problem.


Can you give us a link to a post showing this? (One that's open to guest-viewing, obviously).

meca1

Thank you Kat - Go to this thread: http://meca1.com/forum/index.php?topic=211.0

It's only open to members but for the moment, I've changed that so guests can see it. Please let me know when you're finished.


meca1

I changed it back for now - please tell me when you want to go in there. Or I can PM you with a password.

kat

Password might be good. We're obviously in different time-zones. :)

perplexed

I've seen this before on other forums, when they have converted to smf from other forum s/ware, and once for me when I changed hosts.  Can't remember the cause now, a different version of... something *scratches head*

Kindred

oh..  good catch perplexed... it was in the back of my mind too....   it looks like a character failure.

What is the database language set to?
Are the text values (i.e. the apostrophe and quotes, etc) int eh DATABASE correct (i.e. does the failure occur in the database storage or in the retreval/display?)
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

MrPhil

This is only where "apostrophe, quotes or dashes" should be? As @Kindred pointed out, that sounds a lot like someone used Word to create their posting, and brought over its (sadly misnamed) "smart quotes" with the text. If that's what happened, you'll have no choice but to edit the posts in question to clean them up with proper characters.

Now, what's different about these "older posts", that it doesn't happen (?) in newer posts? Were these "older posts" imported from another forum system? Was the forum encoding (database and/or page display) changed after these "older" posts? It looks like maybe your old text was stored in UTF-8, but now is being displayed in Latin-1 encoding.

meca1

#13
MrPhil -- I believe you may have hit it on the head. They were actually text that I had copied and pasted from different things online that were items of interest. The reason they were 'older' was due to the fact that I had done them a few months ago. So.... if, in the future if I were to copy and paste such stuff, if I delete the quotes and dashes at that time, and re-type them, will that alleviate the problem?

Kindred

you should never paste from word to online...

If you need to paste, then paste to a text editor (notepad, if you must, but there are better ones like notepad++ and others) before pasting to the online box
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

meca1


MrPhil

Yeah, if you edit the posts at some point to change the "smart quote" items to normal, properly encoded characters, they should display correctly. The problem is that they won't be as "pretty" as the original text, especially if your forum is running in Latin-x rather than UTF-8.

If you find yourself frequently cutting and pasting from sources that don't match your forum database and display encoding (e.g., UTF-8 or CP-1252 on other pages, and your forum is Latin-1), you may want to consider converting your forum to another encoding (UTF-8). Note that the process assumes that what's currently in your database (for posts) is correct Latin-1 encoded text, not some hybrid mishmash of encodings.  When you cut and paste, you're bringing over the byte codes for the text you see, in whatever encoding that page is displayed in. You're not bringing over an "em-dash", say, you're bringing over the byte code x91 or whatever CP-1252 uses. If that encoding doesn't match your forum, you will experience the strange symbols.

A forum (and database) in UTF-8 will be able to display any "reasonable" symbol in a post. Note that whether a UTF-8 character can be displayed depends upon the fonts installed on the viewer's browser, not on anything found on your site! I think all the "smart quotes" should be found on just about any PC browser, so that should be safe. The big problem will be converting cut-and-pasted text from Latin-1 or CP-1252 or whatever to UTF-8. It won't happen automatically. Most browsers do not make it easy to enter characters not found on the keyboard.

  • You can manually edit the posts and approximate the offending characters with something found on your keyboard (crude, but fast). That would work even if you stay with Latin-1.
  • You can stay with a Latin-1 database, but change your page encoding to CP-1252 (if that's what you're primarily copying from). That might require a tweak to the SMF code to list CP-1252 as the page encoding instead of Latin-1 (ISO-8859-1). The "smart quote" characters would go into the database unchanged, and properly display upon retrieval. Note that if you pull in any text from a non-CP-1252 page, this won't work.
  • Change to UTF-8 database and page display encoding. You can learn which original characters are problems (especially "smart quotes") and hopefully replace them with their Unicode replacements (&# followed by the UTF-16 decimal value of that character followed by a semicolon I think will work -- give it a try). For a small set of offending characters, that might be feasible -- at least you'll get the proper text displayed. For massive amounts of text, that's too much work.
  • Convert to UTF-8 (database and page encoding). You can cut and paste to a file on your PC, and run a little utility to search out "smart quotes" and replace them with UTF-8 character codes. They'll probably look very odd on your PC, but you should be able to cut and paste them to your forum posts and have them show up correctly (as UTF-8).
  • Something else?

QuoteHow do I know if it's from Word?
You can't really tell if a Web page was itself cut and pasted from Word. Sometimes you can see quotes that are screwed up (much like the text in your first post). If something online uses "typographically correct" opening and closing quotation marks, suspect that it came from Word. You can look at the page source (View > Page source) and see if it lists what page encoding (e.g., CP-1252).

@Kindred's warning applies primarily to text directly cut and pasted from Word on a PC into an SMF post (or anything else online, not just SMF). You have no way of really knowing where HTML in a page came from, except by following hints given in the previous paragraph. Of course you'll know whether or not you're cutting and pasting directly from Word!

P.S. You should review your use of material cut and pasted from other Web pages. Make sure it would be considered "fair use" of copyrighted material, and not theft.

lloydb

Working from Word to html has embarrassed me a few times.

But you can turn off the dum 'smart quotes' stuff. I'm too busy to look it up now but you can find it if you check all your options. Since doing that I have been able to copy from word and than paste into my html editor without a single problem.

The 3 dots in a row that most people are ending their sentences with these days is also a MS special character. It can make a similar mess. It is rare in anything I do. Can't remember if I have that turned off in Word or if I just go and rub it out after it is in the editor.

MrPhil

#18
Yeah, but most people have no idea that "smart quotes" is even in operation, or that it can be turned off. As long as people use Word to type in text, we're going to have this problem. I have no idea why people would even use Word just to type in text for an SMF post -- maybe it's just force of habit for them? After all, the formatting and stuff isn't even preserved during cut and paste.

The "three dots in a row" is an ellipsis. That's another thing that Word pretties up, along with various dashes and quotation marks.

Anybody up for a mod to look for "smart quote" byte codes and turn them into HTML entities or the appropriate UTF-16 entity code (& #nnnn; where nnnn is decimal)? Here's the official Microsoft list for CP-1252:

   Hex code    equivalent                      name

    80         U+20AC  & euro;                 Euro
    82         U+201A  & sbquo;                Low-9 opening quotation mark
    83         U+0192  & fnof;  & #402; *      Florin/script f/folder     
    84         U+201E  & bdquo;                Low-99 opening quotation mark
    85         U+2026  & hellip;               Ellipsis
    86         U+2020  & dagger;               Single dagger
    87         U+2021  & Dagger;               Double dagger
    88         U+02C6  & circ;                 Circumflex ^ accent (combining?)
    89         U+2030  & permil;               o/oo per mille
    8A         U+0160  & Scaron;  & #352; *    S + caron accent
    8B         U+2039  & lsaquo;               Single left angle quote < (guillemet)
    8C         U+0152  & OElig;                OE ligature
    8E         U+017D  & Zcaron;  & #381; ?    Z + caron accent
    91         U+2018  & lsquo;                6 opening quotation mark
    92         U+2019  & rsquo;                9 closing quotation mark/apostrophe
    93         U+201C  & ldquo;                66 opening quotation mark
    94         U+201D  & rdquo;                99 closing quotation mark
    95         U+2022  & bull;                 Solid bullet
    96         U+2013  & ndash;                En-dash
    97         U+2014  & mdash;                Em-dash
    98         U+02DC  & tilde;                Tilde ~ accent (combining?)
    99         U+2122  & trade;                Trademark TM
    9A         U+0161  & scaron;  & #353; *    s + caron accent
    9B         U+203A  & rsaquo;               Single right angle quote > (guillemet)
    9C         U+0153  & oelig;                oe ligature
    9E         U+017E  & zcaron;  & #382; ?    z + caron accent
    9F         U+0178  & Yuml;                 Y + diaeresis/umlaute accent

* recent addition, may not work on all browsers
? very few, if any, browsers support this

These bytes are invalid in Latin-1 and UTF-8; I don't know for sure if any other legitimate encoding uses them.

Advertisement: