Wierd problems, � appears randomly

Started by f00ty, September 01, 2012, 08:27:44 AM

Previous topic - Next topic

f00ty

I converted my SMF database from to UTF8 lately, and there has been some weird character problems. In "latest news" box or in field "latest registered user" in front page of SMF, sometimes, quite randomly scandic letters change to �. Most of the time this doesn't happen. I tried this with Firefox, and was able to get same result with IE9 when this problem was occurring. I searched my database but I didn't find any � marks in there.

Any ideas what could be the problem?


MrPhil

Is it consistently doing this on the same text? What you are seeing is accented ("Scandic") characters that should be in UTF-8 encoding but are in Latin-1 or CP-1252 (single byte accented characters). When you see this happen again, switch your browser to Latin-1 (ISO-8859-1) or CP-1252 (either might be called "Western European") and see if the correct character shows there. Are there other accented characters that now show up as 2 or 3 character sequences (while in single byte mode)?

It sounds like you have text in certain areas that is Latin-1 (or other single byte encoding) rather than UTF-8 in "latest news" or "latest registered user". In the latter case, are these fixed labels or the actual member names that don't show up correctly? In the former case, are you creating the news text in a word processor and cut and pasting it into SMF? If so, the text is likely not in UTF-8. In that case, typing it directly into SMF should work.

f00ty

Quote from: K@ on September 01, 2012, 10:20:15 AM
Do you have the utf-8 language files installed?

http://download.simplemachines.org/?smflanguages

Yes.

Quote from: MrPhil on September 01, 2012, 10:38:44 AM
Is it consistently doing this on the same text? What you are seeing is accented ("Scandic") characters that should be in UTF-8 encoding but are in Latin-1 or CP-1252 (single byte accented characters). When you see this happen again, switch your browser to Latin-1 (ISO-8859-1) or CP-1252 (either might be called "Western European") and see if the correct character shows there. Are there other accented characters that now show up as 2 or 3 character sequences (while in single byte mode)?

It sounds like you have text in certain areas that is Latin-1 (or other single byte encoding) rather than UTF-8 in "latest news" or "latest registered user". In the latter case, are these fixed labels or the actual member names that don't show up correctly? In the former case, are you creating the news text in a word processor and cut and pasting it into SMF? If so, the text is likely not in UTF-8. In that case, typing it directly into SMF should work.

Yes, it is doing it with the same text. It does it with usernames and news itself (data that is pulled off from database). I use SMF to create text, I'm not copying it from a text editor.

I tried the conversion with my browser, and the problem is gone (it messes all the other scandic letters in the page, but the one with problem gets corrected).

But the funny thing is, it does it randomly, it doesn't appear every time. I haven't figured out exact pattern out when it will replace the characters with �, but at least it doesn't do it when everything is fine and then I update the page (I mean, pressing F5 or ctrl+F5). Tried refreshing like 20 times, and everything was fine. But when I didn't use the page like for 1 hour, and came back, the problem occurred. Refreshed page few times, the problem doesn't go away. Waited couple minutes, and refresh, everything looks again as it should.

MrPhil

Are these just some of the accented characters on the page, not all of them? If it were all of them, I would suggest checking what character encoding a page is actually in (using View > Character Encoding) -- perhaps your server is sometimes sending out something to override your UTF-8 encoding with a default Latin-1? That's the only thing I can think of to account for refreshes eventually fixing the problem. If it's just some of the page (some of the time), that can't be it, in which case I'm stumped.

f00ty

Yes, let me explain: In the moment I have disabled the news bubble, so the error occurs only in one place. The place is where forum tells you who is the newest registered user. Since this information shows up only in front page, this is the only part what is messed up. Only the username, nothing else in the page. Not even the description before the username, just the username in this one place. All the other usernames, posts, everything else shows up correctly. Even usernames with scandics show up correctly everywhere. It's just this one place (+ news bubble, which I have disabled now). And this occurs quite randomly as I have explained in earlier posts. If I open up the user profile page, everything is shown correctly.

When the problem happens, the encoding of the page is still UTF8. The same applies to my browser, view is still in UTF8. In the source code I cannot find a mention about ISO-coding when this problem occurs. I just find �-character what has replaced one of the scandics. An as said, I have searched through the database what SMF uses, and I couldn't find any �-marks from there.

I actually remember this happening right after the UTF8-conversion was made, but then when it disappeared, I thought it was just a some kind of glitch.

Just noticed that both values, the latest member and news are pulled off from same table: "smf_settings". I have left the table to use "InnoDB" engine, as some tables I have altered to use "MyISAM" and "smf_log_floodcontrol" uses "MEMORY". I have done these by this tutorial: http://wiki.simplemachines.org/smf/Performance_enhancements Dunno if this has anything to do with the case.

MrPhil

Even if the page has a meta tag to order the browser to display it as UTF-8, some servers are misconfigured to force Latin-1 encoding. That's why I asked you to check your browser's View > Character Encoding. If only part of a page is showing up wrong, that's not the problem.

If all the problems are related to one table, check in phpMyAdmin the character encoding for the fields in that table. Perhaps the UTF-8 conversion failed to change some tables or fields within tables. All text (character) fields should be "utf8-something", such as utf8-general-ci. The MySQL default is latin1-swedish-ci. If you find a table with Latin-1 fields, do a backup and then use phpMyAdmin to individually change the encoding (collation) on those fields to UTF-8. Don't use the SMF conversion -- it will think that it's already UTF-8 and probably do nothing. That should convert the data within the field(s) to UTF-8, and fix your problem. By any chance, do these member names predate your conversion to UTF-8?

f00ty

Quote from: MrPhil on September 02, 2012, 11:21:41 AM
Even if the page has a meta tag to order the browser to display it as UTF-8, some servers are misconfigured to force Latin-1 encoding. That's why I asked you to check your browser's View > Character Encoding. If only part of a page is showing up wrong, that's not the problem.

If all the problems are related to one table, check in phpMyAdmin the character encoding for the fields in that table. Perhaps the UTF-8 conversion failed to change some tables or fields within tables. All text (character) fields should be "utf8-something", such as utf8-general-ci. The MySQL default is latin1-swedish-ci. If you find a table with Latin-1 fields, do a backup and then use phpMyAdmin to individually change the encoding (collation) on those fields to UTF-8. Don't use the SMF conversion -- it will think that it's already UTF-8 and probably do nothing. That should convert the data within the field(s) to UTF-8, and fix your problem. By any chance, do these member names predate your conversion to UTF-8?

Yes, only one field is showing up randomly in wrong format as explained earlier.

The "smf_settings"-field is in "utf8_unicode_ci"-format as all my other tables, except PrettyUrls tables which are in "utf8_general_ci"-format. I also checked the exact field where data is pulled off, the "latestRealName" -field. Everything seems to be fine, the page is in UTF8-format and scandics are showing up fine. This member has created his/hers account after the UTF8-conversion. All the users after, or before UTF8-conversion are showing up fine in every page, except this one field in front page.

MrPhil

This is a real puzzler. Are there any consistencies in how member names fail? Any with accented characters near the tail end, that are cut off at the accented character (turning it into a "?")? I'm wondering if some code is counting the number of characters and then truncating at that many bytes. That should result in member names being truncated, possibly in the middle of a multibyte accented character. The other thing would be some sort of check on member names before printing them out, and the code assumes Latin-1 character set, corrupting accented characters. It's not all accented characters? Can you narrow it down to a specific set?

Do these names that show invalid characters display OK elsewhere in SMF? If they didn't, I would guess that the members cut and pasted them from elsewhere (Latin-1 or CP-1252 character set) rather than typing them in. This is surprisingly common, especially if their keyboards don't support specific Scandinavian characters. However, I would expect these names to fail consistently and everywhere they're used. If not, I'm out of ideas for now. Maybe one of the developers would have an idea where to look -- I don't have time right now to go digging through the code.

f00ty

New member registered, so the old user what used to show up in the member field, isn't in the field "Latest registered member" anymore.

However, I created new user by my own (typed this to SMF):  äöÄÖ-åÅëécóüúùèlkòasdGGä
And the result was after the bug appearing:  ����-����c�����lk�asdGG�

As said, this is the ONLY place (+the news bubble) where the problem happens. EVERY other character in the page is ok. And this happens QUITE RANDOMLY, but I have to wait some time to this happen. For example, if I refresh page multiple times in a row, this doesn't happen. But if I wait 5 minutes and refresh, it MAY happen.

MrPhil

On the page that you type in your member name, do a View > Character Encoding and see if for some strange reason you're in Latin-1 on that page. It should be in UTF-8. If it is, I can't imagine how the name would be converted to Latin-1 encoding after entry. Do you have any mods installed that affect member names?

f00ty

Encoding is UTF8. And as said, this happens randomly. That's the real mind bender to me. I don't have any mods which affect on member names.

Advertisement: