News:

Wondering if this will always be free?  See why free is better.

Main Menu

Any way to clean up formatting codes?

Started by Krashsite, July 31, 2013, 08:11:02 PM

Previous topic - Next topic

Sir Osis of Liver


One of my guys is having problems copying text from articles on other sites.  Sometimes "" appear as ??.  Is there a mod or edit that will prevent this?



When in Emor, do as the Snamors.
                              - D. Lister

Colin

"If everybody is thinking alike, then somebody is not thinking." - Gen. George S. Patton Jr.

Colin

MrPhil

Quotecopying text from articles on other sites
Um, that sounds like possible copyright violations (copying more text than is considered "fair use"). Be careful about that.

The source text probably originated in Word or Outlook, and contains "Smart Quotes". These are Windows-1252-specific single-byte characters that are treated as control codes in Latin-1 and UTF-8. If the source page is in UTF-8, or provides a UTF-8 copy for the clipboard (as Word does), having your SMF forum in UTF-8 will often permit successful pasting into the forum. If the text has passed through a non-UTF-8 application or the application doesn't offer a UTF-8 clipboard load, then who knows what will happen. Most likely the Smart Quotes come over untranslated as single bytes in the x80-x9F range, which will display in UTF-8 as ?-in-black-diamond ("invalid character") or even cause the post to be cut off at that point (what usually happens in Latin-1).

The solution lies in making sure that the source text is UTF-8, and the forum is UTF-8. Any other combination will probably fail if it includes Smart Quotes. Existing Smart Quotes text in a Latin-1 SMF will not be properly converted if you change the forum to UTF-8, so what's done is done. The UTF-8 fix is only for new posts.

Sir Osis of Liver


That's the big worry - if I convert to UTF-8 it may screw up existing posts.  It's an old forum with a large database, around 350K posts.  Simplest thing to do is copy the text into Notepad, then copy it from Notepad to the post editor.  I routinely did that with code I copied out of code tags here, before I started using a macro to reformat.  I've suggested that, will see if he goes with it.

When in Emor, do as the Snamors.
                              - D. Lister

MrPhil

They're not code tags (markup byte(s)). They are characters which use non-standard (Microsoft products only) encoding. If a block of text you copy in contains characters (single bytes) in the range x80 - x9F, they will cause problems when most web browsers try to display them, as they will be interpreted as control codes. Word (and maybe Outlook) provide contents to the clipboard in both Windows-1252 and in UTF-8, so if your forum is (already) UTF-8, there's a good chance that it can take the UTF-8 clipboard and successfully import it, as the offending Smart Quotes characters will already have been changed to proper UTF-8 characters. Once it gets into the Latin-1 database as Smart Quotes encoded characters (x80 - x9F), the only way out would be to back up (export) the database (as Latin-1, if it asks). Drop all the tables, change the database to UTF-8 encoding, and import (restore) the backup declaring the .sql file to be Windows-1252 encoded. If all goes well, the database is UTF-8 and all its text has been converted from Windows-1252 (Latin-1 + Smart Quotes) to UTF-8. This assumes that MySQL/phpMyAdmin knows how to translate an imported .sql file from Windows-1252 to UTF-8. To finish, you will have to manually change your forum setting from Latin-1 output to UTF-8 (I seem to recall that means adding a record to the settings table). You might be able to tell SMF to convert to UTF-8 (after importing), and the only thing it ends up doing is adding the settings record (as all tables and fields are already in UTF-8), but I have not tried this. You can optionally change HTML entities to UTF-8 characters -- that will save some room in the database. And of course, have a good backup (in Latin-1) of the original database that you know how to restore, should something go terribly wrong with all this.

You keep saying "control codes", but from your usage in the first post, it looks like you're referring to Smart Quotes characters. Is the original source of this text Word or Outlook? With some source programs, it might be possible that real "control codes" are embedded in addition to the visible characters, such as codes to shift in and out of italic, or change fonts. This will vary from program to program, especially for native PC programs (not accessed through a browser).

Sir Osis of Liver


Apparently the text is copied from various sources, so formatting will vary. 

My point about copying text from code tags on this forum is that it's not plain text - it carries font face/size format into Wordpad, where I'm editing code as plain text.  Formatting codes are not visible, and they're stripped off when I copy into and out of Notepad. 

What my guy is seeing on his forum is more like what you're describing, characters that browsers don't recognize and display incorrectly.  If it continues to be a problem, maybe I'll try converting the db to UTF-8 per your instructions.  He's running on Hostmonster, and they're phpmyadmin seems somewhat crippled, so don't know if it'll do the conversion.

When in Emor, do as the Snamors.
                              - D. Lister

MrPhil

If they're not standard HTML somehow coming over, then they will be proprietary formatting codes. I don't think there's any standard for these things. They'd have to be handled on a case-by-case basis. HTML could probably be semi-automated into BBCode (I recall there was a discussion recently about some utilities to do that). If you have Wordpad or whatever in "markup" mode (not "visual"), a converter to BBCode could be written.

Cutty

There was no problem pasting text from outside our forum for many weeks since we opened in July. All of a sudden the language conflics began to show up and the only change that was made was adding the security patch issued on 8/12/13, upgrading from 2.0.4 to 2.0.5.

Any suggestions why the change? Could the update have anything to do with it?

I've been reluctant to use the UTF-8 conversion for reasons stated in previous posts of this thread.

Thanks, Cutty

Oldiesmann

How exactly could posts get screwed up by the conversion? I've never heard of that happening, and if it's such a big concern, you should backup your database before attempting the conversion so you don't lose any data if something should go wrong (and it likely won't). If the concern is due to the hosting environment, it might be a good idea to invest in a better host.

The problem is that the characters in question are using a different encoding than your forum, so they're not displayed properly. The only options are to strip the invalid characters or convert to UTF-8 as suggested.
Michael Eshom
Christian Metal Fans

Cutty

Quote from: Oldiesmann on November 05, 2013, 11:50:10 AM
How exactly could posts get screwed up by the conversion?

The problem is that the characters in question are using a different encoding than your forum, so they're not displayed properly. The only options are to strip the invalid characters or convert to UTF-8 as suggested.

I can't answer your question, that's why I asked mine after reading reply #3.

I've been aware that I should have a backup handy should I try the conversion to UTF-8 yet you say that I probably won't need it. The suggestion to have a backup ready, in itself, casts at least some doubt that it may not go off without a glitch. This I can deal with, but how do I explain to forum members who keep insisting that some setting most certainly had to have been changed in the SMF software that the only change was the security patch from 2.0.4 to 2.0.5? Again, there were no encoding conflicts for about 2 months until they began to appear. Our forum has a huge volume of text that was pasted in as a reference archive for reading and the exchange between members is minimal in comparison.

I have thought about the hosting environment as it is not my account, but still, we had no trouble for quite some time so, a change over there maybe? That particular outfit hasn't been too helpful in the past when I inquired about compatibility with installed software not owned by them.

Thanks again, your quick response was much appreciated, Cutty

Advertisement: