Cannot get rid of  character

Started by bosswhite, May 05, 2013, 01:08:56 PM

Previous topic - Next topic

bosswhite

When converting to SMF several rogue characters appeared in my SMF database.

I can get get rid of most but when I try search and replace for the character  (HTML Number Â) (HTML Name Â) it cannot find it although there are hundreds of them.

Does anybody know how I can get rid of these, please?
I've been down so long now it's beginning to look like up..

YogiBear

A bit of groundwork - what software did you convert from ?
SMF v2.1.3  Mods : Snow & Garland v1.4,  PHP  v.7.4.33

MrPhil

Was the old system UTF-8 and now you're running a non-UTF-8 SMF? What you describe sounds just that. If the database itself is UTF-8, it won't find those characters because they don't exist as discrete characters. If you think you converted to UTF-8, ask your browser what encoding it's displaying the page in.

bosswhite

Quote from: YogiBear on May 05, 2013, 01:29:55 PM
A bit of groundwork - what software did you convert from ?

From PunBB 1.2

Quote from: MrPhil on May 05, 2013, 02:28:29 PM
Was the old system UTF-8 and now you're running a non-UTF-8 SMF? What you describe sounds just that. If the database itself is UTF-8, it won't find those characters because they don't exist as discrete characters. If you think you converted to UTF-8, ask your browser what encoding it's displaying the page in.

The old system was latin-1-general and the SMF is UTF8 English British. Browser displays Unicode (UTF8).
I've been down so long now it's beginning to look like up..

MrPhil

Did you convert to SMF while leaving it as Latin-1, and after confirming it converted OK, converted to UTF-8, or was this all done in one go? If you're seeing a bunch of  characters, that would usually mean that the page is outputting in UTF-8 but declared to be Latin-1 encoded. If that's not the case, and phpMyAdmin shows  in the text in UTF-8 fields, something went wrong during the conversions. By any chance did you have to import a database backup during the conversion? I've seen a case where a UTF-8 database was backed up (.sql file) and imported as Latin-1 source (the default in phpMyAdmin), which expanded each accented character into 2 or 3 UTF-8 accented characters. Could that have happened to you?

TheDragon

did you copy text from a webmail program (or something similar?)
also
you can try to copy text from posts then paste text into notepad++
then replace into your posts

MrPhil

I doubt that copying from Word or Outlook (or anything else using CP-1251) would cause this problem. Besides, cut and paste from Word into Notepad++ and then into SMF isn't going to convert Smart Quotes into legitimate characters.

Arantor

Depends if you use Notepad++'s 'convert to UTF-8 without BOM' option or not.

bosswhite

Quote from: MrPhil on May 05, 2013, 09:48:05 PM
By any chance did you have to import a database backup during the conversion? I've seen a case where a UTF-8 database was backed up (.sql file) and imported as Latin-1 source (the default in phpMyAdmin), which expanded each accented character into 2 or 3 UTF-8 accented characters. Could that have happened to you?

My site is live with 3860 registered users, 11071 topics and 75958 posts so I am testing the conversion in a separate forum (forum_2 and database_2) so as to keep the site live untill I've sorted it.
So, I made a backup. I exported as ISO-8859-1 and imported into database_2 as ISO-8859-1. I ran the conversion selecting ISO-8859-1 as SMF default language.

The rogue characters were then present (all sorts, \'   \\   \"   Â etc.) so I then converted SMF to UTF8 and loaded UTF8 English as default language hoping it would clear them.
Just as a matter of interest, I did try various options, set SMF as UTF8 during conversion, etc. etc. but always with the same result.

Quote from: TheDragon on May 05, 2013, 09:56:08 PM
did you copy text from a webmail program (or something similar?)

No, just by exporting and importing through mysql.
I've been down so long now it's beginning to look like up..

MrPhil

What did you use to make (and restore) your backups? phpMyAdmin is good, but avoid the SMF admin database backup (even though its bugs have nothing to do with your particular problems). You say you backed up as Latin-1 -- was the database already Latin-1, or was it UTF-8? If it was UTF-8, your backup was UTF-8. Reading it in as Latin-1 would give you double or triple character replacements for original accented characters and would explain the extra Â's. I'm not sure where you went off the tracks in getting all those escaped special characters \', \\, \" -- they're a different issue. Possibly something doubled the escapes \ before writing the backup file, or something like that. If you still have the earliest backup (.sql file), you could browse it in a text editor on a PC (in Latin-1 or CP-1251 encoding) and see if it has these undesirable characters.

Advertisement: