Missing Topic content in langauges other than default

Started by jonesH, November 02, 2013, 08:27:34 PM

Previous topic - Next topic

jonesH

I am running this: SMF 2.05, AevaMedia 1.4w, SimplePortal 2.3.5, and a few languages (among them German). I do most work on an iPad with latest iOS.

I post a Topic in English, and users of each of the other languages tell me that the content of what I posted is missing (sometimes entirely, sometimes partially). I confirmed that by switching languages and when I tried to Modify, the editor box was empty!

On one occurence, I went back to English, found the topic, deleted the content, saved, retyped it, saved again, and from there on it functioned normally in all languages.

On another occurence, this trick didn't do it, so I switched to one of the languages, typed in the missing content, saved, and from there on all was good again.

In both of these cases, there have been several modifications to the topic (contents) before settling for a final version. But, other topics went thru such revisions and they never vanished in other languages. I see the same query strings in the bar, and I suppose the content of the posting is somewheremon disk, since it shows up properly in english.

I haven't tried posting something similar in another language and see if similarly the content is missing in other than the original posting language.

What could this be?

Arantor


jonesH

Unfortunately, it's a private forum and I'm not at liberty to open it up. I could, though, perhaps clone it, rid it of content and members, and go from there.

Meanwhile, methinks how about dropping the language mods and reinstalling one-by-one, maybe it's just one that's causing this?

Arantor

I am not interested in the content. I am interested in seeing what the site does because what I want to check I can check without you having to open it up, and it's likely faster than my asking you a very technical question that you may or may not know the answer to without going and looking at the database directly...

MrPhil

Quote from: jonesH on November 02, 2013, 08:27:34 PM
the content of what I posted is missing (sometimes entirely, sometimes partially).
Just checking: by any chance is the content (text) being copied and pasted from Word, Outlook, or some other word processor that uses Windows-125x encoding with MS "Smart Quotes"? You mentioned using an iPad, but was the ultimate source a Windows machine? Unless Smart Quotes are properly translated over, they are interpreted as control codes which usually cut off the text at that point.

Quote
On one occurence, I went back to English, found the topic, deleted the content, saved, retyped it, saved again, and from there on it functioned normally in all languages.
If in retyping it, you replaced Smart Quotes with normal quotes, that could explain it.

If you are having problems with Smart Quotes, converting your forum to UTF-8 encoding will usually fix the problem (with new posts). Word and Outlook provide both Windows-125x and UTF-8 versions of the clipboard, so a UTF-8 forum will get the UTF-8 clipboard with already-translated proper UTF-8 characters. Note that if the text was copied first to some other editor, it might still have the original Smart Quotes (as control codes) and UTF-8 won't make any difference. Posts already in the SMF database will not be fixed, and have to be manually edited (look for all forms of ", ', ..., --, etc. and overtype them with ASCII characters).

jonesH

It is quite possible there's some truncation of some kind involved here. This only happened on occasions where multiple edits happened to the content of the topic. Now the edits were done on the iPad, and only using the built-in editor.

On one occasion I typed everything up in iPad's own Notes, then copypasted it into the topic editor, and while it showed up nicely in English, not so in other languages. But then I switched to German, did the same exact thing, and it worked in every language! Very funny. I should perhaps mention that when copypasting some of the formatting was lost, as for ex the crlf - it's all just a blob of text.

The database is hosted on GoDaddy (with what I believe are their default settings), the charset is utf8 unicode, and the connection collation is utf8_unicode_ci. The site is on ssl, and I made some very minor cosmetic changes (css). I used to also have a theme mod installed ("Balanced") which I deleted for lack of further interest. Most of the admin work I carry out on Windows 7 Pro, but postings and such mostly on i-something.

The languages are also utf8.

Kindred

the thing is --- unless you have some whacky mods, changing the language should have absolutely NO effect on the posts...
The language change just switched the fixed text strings of the system....

Although - now that I think about it....

Do you have English or English UTF-8?
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

Arantor

This is why I asked for a link. It's multilanguage, and language packs can force the encoding which can mess it up.

jonesH

I just checked and all of the language mods are -utf8, while everything *.english.* are without -utf8.

OK then, I guess a good idea might be to drop the language mods and redo without -utf8?

Arantor

No, UTF-8 is the way it should be... But just having UTF-8 files is no good if the system isn't configured for it.

Kindred

right....   so what that suggests is that you need to switch your full forum to use UTF8, including English and the database
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

MrPhil

Yes, the database AND all the language files AND the output must all be the same encoding (presumably UTF-8 for your case). You didn't just bring in UTF-8 non-English language files into a Latin-1 forum, did you? If you had Latin-1 accented characters in your database and then dumped UTF-8 non-ASCII characters into a still-Latin-1 database, you're going to have a mess on your hands. If your database was pure ASCII for Latin-1, you won't have a mixture of Latin-1 and UTF-8 accented characters, but you still can't easily convert the database to UTF-8 because it will think those multibyte UTF-8 accented characters are multiple Latin-1 bytes to be converted. You might have to export your Latin-1 database to .sql, clear the database, convert all fields to UTF-8 text, and import the .sql as a UTF-8-encoded backup.

Retyping rather than copy and paste: " is a real ASCII quotation mark, etc.

Copy and paste in Smart Quotes text while in German-utf8: the forum is accepting UTF-8 input at this point, so Smart Quotes get turned into the proper UTF-8 punctuation characters. The conversion is actually done in Word, and the UTF-8 clipboard is copied in. Still stored in the database as a series of Latin-1 accented characters.

Copy and paste in Smart Quotes text while in English: the forum accepts the text as Windows-1252 and stores it in the database as these non-standard characters. When trying to display the text in any language or encoding, the Smart Quotes act as control characters and truncate the text there.

Does this sound like what's happening?

jonesH

It sounds like we're onto something here! Give me a chance to wrap my head around this, do some work and see how it all works out. I will report back here when I get a more solid story.

Thank you all so much for the great help so far!

jonesH

OK, so I poked around in the database and the findings are that:

- the server is set to charset UTF-8 Unicode (utf8) and I can't seem to have rights to changing this
- the connection collation was set to utf8_unicode_ci
- the database collation and the fields are set to utf8_general_ci

In light of this, I changed the server connection collation to the same utf8_general_ci


So when I look at the MySQL System Variables table in phpMyAdmin, I see that:

- everything relating to the charset is now utf8
- all the collation settings are now utf8_general_ci

(Never had any Latin1 or other such settings before, as I believe some utf8 flavor is the default).

Mine is a very low traffic forum, so I will have to wait and see if these changes are for the better.

MrPhil

utf_general_ci differs from utf_general_ci only in minor sorting (collation) differences. Both are still UTF-8 encoding. What you need to check is that each text field is UTF-8. Ideally they should all be the same collation (general, unicode, etc.) to avoid certain MySQL errors of mixed collation.

It sounds like your forum is operating in Latin-1 when in English, and UTF-8 for all other languages. Can you check whether or not your site is Latin-1 or UTF-8 while in English? As long as English text does not include any non-ASCII characters (accents, special punctuation, etc.), it is equivalent to UTF-8 (both Latin-1 and UTF-8 share the ASCII character set x00-x7F in common). It would be a problem if while using Latin-1, you pasted in Word content with Smart Quotes (Windows-125x encoding). That would put invalid bytes into the database that would be interpreted as control codes during output, usually truncating the post at that point. Pasting in Word content during UTF-8 is usually not a problem, as the UTF-8 encoded clipboard will normally be used.

If you find that you have indeed been not only using Latin-1 for English, but also have been pasting in Smart Quotes, you should fully convert to UTF-8 (using english-utf8 language pack), your database should be UTF-8, and your pages should always be encoded UTF-8. However, old posts will still contain the Smart Quotes, and will need to be manually edited.

Maybe someone would like to come up with a utility that reads each post, etc., and translates codes x80-x9F to an appropriate UTF-8 character (assuming they are originally Smart Quotes)? Of course, this applies only to UTF-8 databases. Windows-1252 would be the default origin, but the forum owner could override this if they have reason to believe that other versions of Windows-125x were used.

jonesH

I'm back with some news (hopefully good ones).

Inspecting the html it turned out that the pages were rendered in ISO-8859-1 (or Latin-1, basically). The default language was English (ISO), so I istalled English_utf8 and I set it as default (via radio button in settings).

BUT: my impression is that the software does not actually distinguish between default English and UTF-8, as indicated by the fact that when choosing Language in Profile only one English choice showed up (I assume, the default) and that the forum never really seemed to change to the new, UTF-8 version as per the radio button choice.

Anyway. I ftp'd into Themes\default\languages and moved aside all the original *.english.php and left the *.english_utf8.php version in place.

The forum did not seem to break in any way, and results of very frugal testing look promising. At the very least, the forum's pages themselves and several test-postings from various users were all in UTF-8. The only language choices available now are the _utf8 versions, including the (one) english.

What else should I know/do/expect?

MrPhil

OK, all your language files are now UTF-8. What is your database (should be UTF-8, including all individual text fields)? What are all the pages displayed in (should be UTF-8)? If all the table fields are now UTF-8, is all data in them true UTF-8, or did anyone put non-ASCII characters in a post while in English? Those would either be unconverted and thus illegal UTF-8 (show up as ?-in-black-diamond) or unconverted control codes (were Smart Quotes) and will probably still truncate the post there, or maybe something else. Multibyte UTF-8 characters that got put into a Latin-1 text field presumably were erroneously converted to multiple multibyte UTF-8 characters (a mess). If your English text was pure ASCII, it should be OK. If no UTF-8 non-ASCII characters were put into a Latin-1 text field before it was converted to UTF-8, it should be OK.

Be prepared for reports of odd characters, invalid characters, old posts still truncated, and whatnot.

jonesH

MrPhil: a few members and I posted and replied back and forth, switching languages in the process, and so far all seems good. Let's hope it holds, giving me a bit of time to read up on some related topics.

In any case, I'd like to help you and all the others for the help and guidance you've given me. Much appreciated.

Advertisement: