News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

Undefined index: character_set

Started by MobileCS, August 26, 2016, 01:22:04 PM

Previous topic - Next topic

MobileCS

I'm using SMF 2.0.11 and out of no where, I'm starting to see these warnings in my error log.

PHP Notice:  Undefined index: character_set in /www/example.com/httpdocs/forum/Sources/Subs-Post.php on line 1220
PHP Notice:  Undefined index: utf8 in /www/example.com/httpdocs/forum/Sources/Subs-Post.php on line 1263
PHP Notice:  Undefined index: utf8 in /www/example.com/httpdocs/forum/Sources/Subs-Post.php on line 1265

Line 1220:
$charset = $custom_charset !== null ? $custom_charset : $context['character_set'];

Line 1263:
if ($hotmail_fix && ($context['utf8'] || function_exists('iconv') || $context['character_set'] === 'ISO-8859-1'))

Line 1265:
if (!$context['utf8'] && function_exists('iconv'))

I do not have my forum converted to UTF-8. Any ideas why this would all of a sudden pop up? I've never seen this before.

Illori

i think it is an issue we have not found a resolution for, or know exactly what causes it.

MobileCS

Ok, I also saw a bunch of these warnings :

PHP Notice:  Undefined index: server in /www/example.com/httpdocs/forum/Sources/QueryString.php on line 477

Line 477:
if (!empty($modSettings['queryless_urls']) && (!$context['server']['is_cgi'] || @ini_get('cgi.fix_pathinfo') == 1 || @get_cfg_var('cgi.fix_pathinfo') == 1) && ($context['server']['is_apache'] || $context['server']['is_lighttpd']))

After a closer look at the logs, these errors (and the previous one) are only coming from 2 different IP addresses - and has since stopped.

Shambles

Do those addresses resolve to Googlebot by any chance (eg, 66.249.95.255) ?


richardwbb

Quote from: Illori on August 26, 2016, 01:39:53 PM
i think it is an issue we have not found a resolution for, or know exactly what causes it.

I have seen that Settings.php had omitted the '$db_character_set = 'utf8';' from version 2.0.9 or so.

Since then I had mangled characters inside the database, [https://en.wikipedia.org/wiki/Mojibake]

When I set the forum-language to UTF-8 with a language pack, typing à á ä ò ó ö è é ë , posting this. [I mean, putting it in the database], then I set forum language to ISO-8859-15, then I got latin1 encoding, eg. 'é' [é]  or worse; ÃÆ'Į<double encoding.

Then I let forum in ISO-8859-15, typing  à á ä ò ó ö è é ë and putting UTF-8 language back, then same thing, eg. 'é' [é]  or worse; ÃÆ'Į<double encoding.

So I put back the $db_character_set = 'utf8'; in Settings.php [From this point every message was displayed correctly but already malformed messages did not transform back]

From there, all wrongly encoded characters inside the database had to be repaired manually I am able to assist with that.

grepping for 'character_set' in the php files of SMF [this is way I tell my story here], gave me the impression the boolean operaters on language specific encoding is contradicting. From my point of view, the ISO-8859-15, isn't working at all. Everything must be UTF-8, and the  $db_character_set = 'utf8' MUST be in Settings.php but I didn't see that in a fresh 2.0.12 install.

Sorry for the topic hijack, [though I believe I am being helpful here], my solution is set everything to UTF-8, close the forum, take the database, encode it properly and put it back and never touch those settings again [there is no need for that either with UTF-8]

It took me a long time [weeks] to learn encoding a mojibaked double encoded and gremlind database file. But, on a live forum, I wrote a script myself and was able to repeat the cleaning/ repairing process. If your forum isn't live there are easier ways to do this.

Also, if possible, run a test-copy of a forum, play with the lanuage files [ISO-8859-15 and UTF-8], type some   à á ä ò ó ö è é ë in a personal message with yourself as the receiver and have a look what happens on screen, and look that message up in the MySQL database export sql, with a text editor, and if you look at a disturbed live database you might also need a hex editor, though I found a way to repair without one. a ISO-8859-15 'ë' and a UTF-8 encoded 'ë' isn't the same byte sequence in the sql export file.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Arantor

QuoteI have seen that Settings.php had omitted the '$db_character_set = 'utf8';' from version 2.0.9 or so.

Depends a certain amount on whether the forum was installed with UTF-8 initially or installed as ISO and later converted, and whether you have the correct files installed.

QuoteWhen I set the forum-language to UTF-8 with a language pack,

This also needs to match what the database physically is.

Quotebut I didn't see that in a fresh 2.0.12 install.

Again, fresh install with UTF-8 or fresh install without UTF-8?

Quotemy solution is set everything to UTF-8, close the forum, take the database, encode it properly and put it back and never touch those settings again

Pretty much UTF-8 everywhere all the time. 2.1 installs as UTF-8 and upgrades to UTF-8.

QuoteAlso, if possible, run a test-copy of a forum, play with the lanuage files [ISO-8859-15 and UTF-8]

The process is to install the correct language files to match what you've actually installed in the first place.

richardwbb

1; I downloaded [http://download.simplemachines.org/index.php?thanks;filename=smf_2-0-12_install.tar.gz] so I don't know what that would be. This wasn't upgrading so I assume it was UTF-8. I do know that phpMyAdmin is reporting for the live forum at the ISP;

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
Variable_name    Value    
character_set_client    utf8mb4
character_set_connection    utf8mb4
character_set_database    latin1
character_set_filesystem    binary
character_set_results    utf8mb4
character_set_server    latin1
character_set_system    utf8
collation_connection    utf8mb4_unicode_ci
collation_database    latin1_swedish_ci
collation_server    latin1_swedish_ci

The MySQL server I have installed for testing puposes is saying;

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+-------------------+
| Variable_name            | Value             |
+--------------------------+-------------------+
| character_set_client     | latin1            |
| character_set_connection | latin1            |
| character_set_database   | latin1            |
| character_set_filesystem | binary            |
| character_set_results    | latin1            |
| character_set_server     | latin1            |
| character_set_system     | utf8              |
| collation_connection     | latin1_swedish_ci |
| collation_database       | latin1_swedish_ci |
| collation_server         | latin1_swedish_ci |
+--------------------------+-------------------+
10 rows in set (0.00 sec)

The one thing I know is true, is that I had put my.cnf in such a way that phpMyAdmin is reporting exactyle the same. My assumption is that the MySQL database at the ISP is in a latin1 collation.

2; I believe anyone should be able to use a UTF-8 SMF forum install and that SMF byitself agrees with MySQL what encoding to send. I learned that switching language from or to ISO-8859-15, gave opposite results, I mean by that; switching the language template encoding, gave the opposit displayed characters, so 'ë' gave ''é', while I think it would be a good idea to let SMF decide, because I expect Site Admin not to be aware of ISP MySQL settings, and I do not know if ISP already are accepting UTF-8, because the only reason I can think of, is that on a live MySQL in latin1, the only way to go to UTF-8 is to not touch the existing latin1 content in the database. So a MySQL exported sql file still needs to be UTF-8 encoded, which I know it did, the only difference was that what was set with SMF and Windows clients in win-1252 encoding, resulted in gremlin [https://dancingmammoth.com/2010/10/28/removing-utf8-gremlins/] and double encoding and some irrepairable Mojibake [though, manually repairable with little effort and gave me the impression that data corruption at the client's transmission to the database with win-1252 encoding mistook for latin1 encoding <- but I am aware I am lacking in-depth knowledge on this one]

3; Please tell me what I have done.

4; I sincerely think the user shouldn't be able to decide what encoding to use talking to MySQL.

5; I do not understand that SMF talks different with MySQL by letting the user change the encoding to ISO-8859-15 [or 'non-UTF-8' if SMF is able to cope with non-UTF-8 anyway. I think a 'wget -O - -o /dev/null --save-headers 'url-of-server' | grep Content-Type' will tell what encoding to use, and surprisingly, Apache 2.2 or higher talks UTF-8 and won't change it's mind with 'AddDefaultCharset' or 'AddCharset'.

Quote from: Arantor on November 05, 2016, 08:20:46 PM
1; Depends a certain amount on whether the forum was installed with UTF-8 initially or installed as ISO and later converted, and whether you have the correct files installed.

2; This also needs to match what the database physically is.

3; Again, fresh install with UTF-8 or fresh install without UTF-8?

4; Pretty much UTF-8 everywhere all the time. 2.1 installs as UTF-8 and upgrades to UTF-8.

5; The process is to install the correct language files to match what you've actually installed in the first place.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Kindred

Smf 2.0.x installs as Non UTF-8, by default...
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

Arantor

QuoteThis wasn't upgrading so I assume it was UTF-8

Nope. Whether you're using UTF-8 on a fresh install is a product of 'did you tick the box on installation' or not.

QuoteI do know that phpMyAdmin is reporting for the live forum at the ISP;

It is reporting nothing of any consequence or relevance here; merely that without any instruction to the contrary, MySQL is set to use Latin1 (aka ISO-8859-1). This has been the default in MySQL since basically forever. It also indicates that phpMyAdmin has issued the command to MySQL to treat that connection as UTF-8. Which means a whole host of things, but none of them are relevant to this.

QuoteI believe anyone should be able to use a UTF-8 SMF forum install and that SMF byitself agrees with MySQL what encoding to send. I learned that switching language from or to ISO-8859-15, gave opposite results, I mean by that; switching the language template encoding, gave the opposit displayed characters, so 'ë' gave ''é',

Everyone should use UTF-8 and exclusively that, but because for many years, MySQL didn't have the facilities SMF would have needed, SMF actually defaults to ISO encoding.

You learned, also, that you switched languages in a way that didn't bother to actually convert the data.

If you take a UTF-8 accented E, tell both sides of the equation that they're using ISO but the data hasn't actually been converted (as in, the data is still physically UTF-8), this is exactly what you get. An accented e in UTF-8 takes 2 bytes. Those two bytes mean one thing when all parties agree that it's UTF-8 and they mean something completely different, i.e. two separate characters, which is what you get displayed.

If you want to use UTF-8, make sure everything is UTF-8 - you can't just change individual parts of the equation randomly. Either 1) install the forum from scratch, and tick the UTF-8 box during installation, or 2) install the forum from scratch (meaning it will use ISO-8859-1) and then convert it using SMF's admin area which converts the database for you. In either case you need the UTF-8 language files since the forum is then UTF-8. If you try to change bits of it yourself, it will go wrong.

Quote3; Please tell me what I have done.

How, exactly, should I work out whether you ticked a box during installation of your forum? My guess is that you didn't tick it, meaning that SMF internally will assume ISO for everything and doing anything other than using SMF's own conversion option will break characters.

Quote4; I sincerely think the user shouldn't be able to decide what encoding to use talking to MySQL.

There is no place the user can do this - except by 1) directly editing the configuration file that is fairly explicitly commented as 'don't edit this by hand' or 2) installing language files that are wrong for the forum.

This last one is largely a side effect of how things were done years ago. When you have non-UTF-8, every byte is one character but since there are more than 255 possible characters and only 255 spaces, you have to denote which set you're using. Which is why the language pack itself indicates the relevant encoding - so the ISO Greek language pack knows to fetch the Greek characters and thus display the Greek characters, or the Arabic encoding knows to display the Arabic ones, and so on.

richardwbb

I've just installed a 2.0.12 forum from ground up, I indeed noticed that it gave the db_character_set setting in Settings.php. Also learning now, why, things are the way they are;  I stil have no understanding for what happened, but I understand what Arantor is saying. I didn't attempt to learn from my ISP when their database went to UTF-8 and it turns out that now I think of what I have done in the past, that setting to UTF-8 was there in Settings.php, of course by doing an update by me. So that might aswell have been the problem. Yet I at this point in writing I'm not able to see why it is a user setting. Also, I inherited the forum of someone, who probably installed it in ISO-8859-15, because it appears to me, that back then, the question by SMF installer for UTF-8 or not, wasn't asked for yet. And my point is, I would like to help other people avoid that their forum might run into mangled characters or dropped messages at the first special char. It is hard to solve a wrong encoded database and the damage can be sustantial before noticed. [And I know of some well known sites with mangled characters, for example, linux.com].

And to not forget, I've converted win-1252 encoding to utf8. I'm looking to learn how to avoid damage in the database by encoding.  I assume the win-1252 encoding got there because SMF was expecting UTF-8. And then I learned that what I had was a mix of Mojibake, double encoding and mysteriously, I had a small perecentage of characters that wen't beyond repair or converted back to something I can be assured since I know the language and in what sentence it was written, what it should have been. I also learned that people are able to find a lot of characters with their computer, I am not aware of how to input those.

Quote from: Kindred on November 06, 2016, 02:14:47 PM
Smf 2.0.x installs as Non UTF-8, by default...

I am not able to look back into the sql export of the database, because I have found wrongs in it, that have been present longer then I am the Site Admin of that forum. What I do know, I guess I've left it turned on at an upgrade [though I never upgraded, always started over, I have not learned yet how to upgrade and keep all the modifications intact, I assume this is not possible]

I also know that I've turned it off, I also changed language template. What I didn't do, is formulating my question properly, lacking knowledge about what UTF-8 implementation means globally, and always have been looking at SMF, the ISP, and a phpMyAdmin exported database <- which has to be mysqldumped[!]. The only thing I really looked after, considering it beingt the most important, making all the posts of the users survive through the times. Also, the ISP wasn't of much help, because I was unable to formulate my question and the one topic I remember from here, dates back to 2013 [click]

But I have been playing with those settings, which basically are db_character_set=utf-8 in Settings.php and switching of language templates, I never saw why a template in a different encoding would make the way messages are stored in the database, change.

So you have explained to me that MySQL is set to use Latin1, which I can understand, and it is not clear to me what you mean by 'phpMyAdmin has issued the command to MySQL to treat that connection as UTF-8', then I ask, for what this UTF-8 checkbox is needed for, since I have learned that making SMF UTF-8 aware, with the wrong language template, [that would be ISO-8859-15], then, by my serious attempt to repair this [on a linux box this time], setting the template back to UTF-8 and posting a à á ä ò ó ö è é ë ï message, made it show like Latin or the other way around, depending on where you have left a consistent setup, and for my account, I can't tell anylonger how that was exactly, I *do* know that putting SMF to UTF-8 awareness [checkbox], didn't make it refuse the language template in another encoding [though I know that *can* be a correct way to look at things], however, changing the language template [and with both language template, testing message à á ä ò ó ö è é ë ï, I've seen that it switched back from; [the options I've had where choosing the language templating and omitting the set_db_character=utf8 in Settings.php, but it always was like;]

attempt 1;
first test message; [proper] <- it always starts here of course, being proper

attempt 2;
first test message; [not proper]
second test message; [proper] <- same goes for another test message of course

attempt 3;
first test message; [proper] <- Ola!
second test message; [not proper]
third test message; [proper]

So if Settings.php receives a db_character_set by a check box, for the aware or not so aware Site Admin, and there can be two language template setting, UTF-8 and non-UTF-8, with the first post being [proper] - [not proper]  [proper]

I know two things;

There has been double encoding in a database that talks UTF-8
There has been two 'switches' within SMF, the UTF-8 check-box and the choice of language template.

Therefore I say that when the three attempts of the test messages, show consistent behaviour, and SMF is both able to talk UTF-8 or Latin to it, then with a quick test at installing or updating SMF, it can be known what language template has to be used and how to talk to the database, leaving the displaying of characters, with the encoding of that file and the [Apache] webserver.

I believe this would have spared me the trouble I went through. Though, let me know what you think or why my assumption on the dual switch and the abillity to speak in both languages wouldn't be suffice for SMF to know things by itself as I have learned [as far I can oversee] that any wrongly encoded data to the database only can be converted back to manually.

And another thing, I still don't get it why I found win-1252 encoding in there, besides that UTF-8 speaking SMF/ MySQL would not know the difference any longer and I do as a native speaker of the data that was typed to the database, but, both this double encoding and the win-1252 'gremlin' *seems* to have destroyed a percentage of the data and I would like to learn how that was possible [though then I might be able to repair still, I see]

And as I wrote already, willing to help prevent other Site Admin's to get stuck with a mangled database. It can be irrepairable and I have seen friends already dropping all messages and starting over <- and that just isn't necessary at all.

You 'wrote 'You learned, also, that you switched languages in a way that didn't bother to actually convert the data.' I regret I missed any reading on what it meant and that I already learned from the webbrowser and the webserver, it can have meta charset=utf-8 and everything still looks the same. With SMF, it was the other way around and I had to learn MySQL to learn more. And I still don't understand everything of this, and all this I am writing while the database I am responsible for, has been repaired.

You wrote 'you can't just change individual parts of the equation randomly.' You are right, that is what I did, in the end I learned even, to omit the set_db_character setting in Settings.php and I put it back manually [what happened between those two steps, I don't know, all I know all of the sudden nearly all, but *not* everything went to double encoding and/ or latin1 encoding, which I am unable to explain how that has been possible and I still found UTF-8 encoded data, predating my start of being Site Admin, and, *that* shouldn't be possible in the first place.

You wrote 'How, exactly, should I work out whether you ticked a box during installation of your forum? My guess is that you didn't tick it, meaning that SMF internally will assume ISO' Arantor, you are right, from that point I have clear memory. How I got there I am not sure of, it isn't the UTF-8 checkbox if you ask me. But if you say this is not possible, I am willing to learn from you, reasoning for that.

You wrote 'There is no place the user can do this - except by 1) directly editing the configuration file that is fairly explicitly commented as 'don't edit this by hand' or 2) installing language files that are wrong for the forum.'

You are correct I shouldn't be editting the Settings.php [though I never RTFM'd anything, hence it isn't RTM, I suppose [though I will do if it has catched my interest]] and for the wrong installation of language files, I wonder, why is that possible, and I add to this, what is the point of the UTF-8 checkbox again?

You wrote 'This last one is largely a side effect of how things were done years ago.'

Can you explain to me what you mean by that. I am glad I have knowledge of the ASCII table and knowing that 8 bits means 256 possibillities since I started with a 8 bit computer [and it took me a while to see the difference in extended ASCII, which don't seem to exist and latin, and win-1252 encoding [which isn't a bad enoding, just like UTF-8 is my opinion] and that a computer in the old days had 7 bits of it's own machinecode. They should have been excercising UTF-8 right after the need for character 256 I think. [And latin and win-1252 encoding is lacking Russian and Japanse [this I 'awk'-ed manually, because it was totaling two posts] and goes a long way displaying those characters like those emdash, smiley and spades etc.]

Quote from: Arantor on November 06, 2016, 02:21:31 PM
Nope. Whether you're using UTF-8 on a fresh install is a product of 'did you tick the box on installation' or not.

It is reporting nothing of any consequence or relevance here; merely that without any instruction to the contrary, MySQL is set to use Latin1 (aka ISO-8859-1). This has been the default in MySQL since basically forever. It also indicates that phpMyAdmin has issued the command to MySQL to treat that connection as UTF-8. Which means a whole host of things, but none of them are relevant to this.

Everyone should use UTF-8 and exclusively that, but because for many years, MySQL didn't have the facilities SMF would have needed, SMF actually defaults to ISO encoding.

You learned, also, that you switched languages in a way that didn't bother to actually convert the data.

If you take a UTF-8 accented E, tell both sides of the equation that they're using ISO but the data hasn't actually been converted (as in, the data is still physically UTF-8), this is exactly what you get. An accented e in UTF-8 takes 2 bytes. Those two bytes mean one thing when all parties agree that it's UTF-8 and they mean something completely different, i.e. two separate characters, which is what you get displayed.

If you want to use UTF-8, make sure everything is UTF-8 - you can't just change individual parts of the equation randomly. Either 1) install the forum from scratch, and tick the UTF-8 box during installation, or 2) install the forum from scratch (meaning it will use ISO-8859-1) and then convert it using SMF's admin area which converts the database for you. In either case you need the UTF-8 language files since the forum is then UTF-8. If you try to change bits of it yourself, it will go wrong.

How, exactly, should I work out whether you ticked a box during installation of your forum? My guess is that you didn't tick it, meaning that SMF internally will assume ISO for everything and doing anything other than using SMF's own conversion option will break characters.

Quote4; I sincerely think the user shouldn't be able to decide what encoding to use talking to MySQL.

There is no place the user can do this - except by 1) directly editing the configuration file that is fairly explicitly commented as 'don't edit this by hand' or 2) installing language files that are wrong for the forum.

This last one is largely a side effect of how things were done years ago. When you have non-UTF-8, every byte is one character but since there are more than 255 possible characters and only 255 spaces, you have to denote which set you're using. Which is why the language pack itself indicates the relevant encoding - so the ISO Greek language pack knows to fetch the Greek characters and thus display the Greek characters, or the Arabic encoding knows to display the Arabic ones, and so on.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Arantor

QuoteYet I at this point in writing I'm not able to see why it is a user setting

The things in Settings.php are not considered user settings. But it has to be listed somewhere so SMF knows how to deal with the things going on because SMF cannot just magically know before connecting to the database whether to use UTF-8 or not. Something has to tell it, and this is the safest place to do that. It's only a user option during install because a) not all hosts could support it for the longest time and b) there used to be performance tradeoffs even if you could use it because UTF-8 is always bigger and slower than ISO anything. Of course, in the last decade advancements in computing have largely made this problem go away but the fact remains it used to be a viable proposition to not use UTF-8.

QuoteI assume the win-1252 encoding got there because SMF was expecting UTF-8

I think you misunderstood what I said, that SMF put UTF-8 characters into the database and everything else is really treating them as ISO-8859-1 (which is not the same as Win-1252 though in most cases it's 'close enough')

Quotewhat is the point of the UTF-8 checkbox again?

See above. There was a time during SMF's history when not going UTF-8 was a viable choice. There is also the reason we (I, personally, in fact) made the choice to not do that in 2.1, such that 2.1 only installs UTF-8.

Quoteit can be known what language template has to be used and how to talk to the database, leaving the displaying of characters, with the encoding of that file and the [Apache] webserver.

It's actually really simple. You have to know how to talk to the database from the off. As in, literally, the first command issued to the server after connecting is 'use UTF-8'. You need to know that pretty much immediately and the best place to put that is in Settings.php.

QuoteThey should have been excercising UTF-8 right after the need for character 256 I think.

Oh, it really, really isn't that simple. When the ISO encodings first came out, space was tight. I'm talking back in the 1980s when individual bytes were a concern. Which is why all the ISO encodings are single bytes per character.

The UTF family came along much later on, first with UCS-2 (two bytes per character, as used throughout the internals of Windows NT) and UCS-4 which later became UTF-16 and UTF-32. UTF-8 is relatively new in a lot of ways, and is much more complex to process because you don't know at the start how many bytes per character. You can't know until you start to examine the characters how many characters there are.

The first byte in a character tells you how many bytes make up that character, based on the arrangement of the first three bits of the number. Because this is massively more complex (and they trimmed it down, UTF-8 was originally supposed to allow 6-byte characters, now it only allows 4-byte characters as a maximum)

Quoteextended ASCII, which don't seem to exist and latin, and win-1252 encoding

All of the ISO encodings, plus Win-1251 through Win-1253 are all types of Extended ASCII. Officially, ASCII denotes the values for 0 to 127. Everything above that is officially extended, just in different ways, like how IBM defines code pages like CP 437 which defined most of the higher characters to be text characters for drawing boxes on the screen, or 850 which covered most of the things needed by France and Germany (and is much the same as ISO-8859-1 is today). Or how we have ISO-8859-6 which defines the top 128 values as a subset of Arabic.

UTF-8 is merely a different kind of encoding which specifies that the top set of values indicate continuation. Values 0 to 127 remain the same as ASCII, with all the values above indicating that they are parts of larger characters - the first byte tells you how many bytes follow for the rest of the character (with values 0-127 always indicating a one byte character)

The real problem can come in with Win-1251 and Win-1252 which has a habit of using quote symbols that tie into the range above 128 (in such a way that they are the second character in a block of bytes forming one character), but since they don't have a correct preceding character, they can end up damaging content.

I'm not 100% clear on exactly what damage you had or what repairs are being done but all the issues sound like there was a mismatch between what the database was using and what SMF thought the database was using - and this can happen if the language packs mismatch since the language packs indicate for ISO which codepoints mean what. It is the language pack that knows to differentiate between ISO-8859-1 and ISO-8859-6 for example - because they're single byte encoded, the database does not need to care so it leaves it to the language pack to make the decision. But for ISO-anything vs UTF-anything, you need to get it right otherwise it's going to cause you trouble.

richardwbb

Quote from: Arantor on November 09, 2016, 06:13:40 PM
The things in Settings.php are not considered user settings. But it has to be listed somewhere so SMF knows how to deal with the things going on because SMF cannot just magically know before connecting to the database whether to use UTF-8 or not. Something has to tell it, and this is the safest place to do that. It's only a user option during install because a) not all hosts could support it for the longest time and b) there used to be performance tradeoffs even if you could use it because UTF-8 is always bigger and slower than ISO anything. Of course, in the last decade advancements in computing have largely made this problem go away but the fact remains it used to be a viable proposition to not use UTF-8.

Arantor, I respect your thourough explanation though you lost me a little, is it right there is more to it then Latin1 and UTF-8 for SMF and can you please explain to me, why SMF or any other application that talks to a MySQL database has to be told what encoding to use? Is it that because I only used Latin15 which isn't Latin1, > UTF-8 that I miss to see the technical background on how SMF has to interpret what is trusted to the database by it's users [from all over the world], or, is it true that I understand from your words, that, in SMF 2.1, SMF will be UTF-8 by itself? Because, I am in the process of installing SMF 2.0.12 from scratch, and I noticed that it, indeed, is the checkbox UTF-8 being responsible for the '$db_character_set = 'utf8';' and, the language the original SMF package comes with is Latin1? This made me wonder, so I typed a testing message; 'à á ä ò ó ö è é ë ï', installed a UTF-8 language template, and, re-read my testing message which now said; 'à á ä ò ó ö è é ë ï'.

Switching back to Latin1 did also revert the testing message, so I'm guessing a little, since I have no reason to investigate, that this are Latin15 characters, Latin1 encoded, right? Then, I asked myself, then there is no reason for an application to be aware of the encoding set trough MySQL [the settings that the ISP are responsible for], because, if I let something talk UTF-8 to it, MySQL accept the data sent to it and will talk back UTF-8 [if collation setting is UTF-8], and, an ISP with a [old] database, let's assume most of all ISP's where at Latin1 and changed collation, so internally it still talks the way it was installed a long time ago. And the settings I posted in my last post on what phpMyAdmin sees and what MySQL says, it looks like this is so. Again, I have no reason to investigate this.

Now I wonder, why is it possible to select UTF-8, if SMF comes with Latin1 only? Shouldn't it atleast come with two language templates and switch to the language template that UTF-8 checkbox, checked, or not checked, requires? I found no reason to leave UTF-8 turned on and letting SMF run with a Latin1 language template, though, I expect this to work, and, to be not set correctly by the Site Admin of that forum? Isn't that a pitfall most users don't see? I admit I am not RTM for almost everything, but I also had a hard time back in 2013 to Google or use SMF search, to find the required part of the manual or a support topic with enough waypointers for me to be able to continue without the need for help? I also preferably don't post a question, without trying first myself. And I still am not satisfied with what I know, now, about, what I did wrong, and also, is it true that checking a UTF-8 checkbox and keeping the old database, which obviously can't be UTF-8, will change all non-Latin1 characters two atleast two characters starting with Ã?

Shouldn't SMF be backward compatible, parsing content re-appearing from My SQL, I mean, a user reads a post written before UTF-8 checkbox was checked with SMF, that every 'Ã' will be decoded to it's character? I know this is possible. I do not know if there can be reasons why it isn't this way. I only feel that Site Admin's with a running SMF forum [and a lot of history in the form of a database], see this checkbox, and that media has learned people in general that UTF-8 is the future and that it is safe. <- I wonder, where, this 'Ã' + exotic character, should be converted back?

I did use the convert database to UTF-8 with SMF, of course I knew I had to backup everything first and it didn't gave a proper result. Though, I can't remember what that was.

And then, I found a UTF-8 decoder, 'https://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder'; and if you go up one directory on that site, it comes with the Perl script. With the UTF-8 library and a linux box, I have it running now on my localhost website for testing purposes.

So I had found about sixty thousand 'Ã' in a hundred thousand posts database, and not everything was properly decoded by the decoder link I posted. Again a lot of time spent, I used 'https://dancingmammoth.com/2010/10/28/removing-utf8-gremlins'; and that was because I learned that I was not suffering double encoding for everything, and, it didn't seem Mojibake completely to me. Also, finding a proper hex editor to learn the byte encoding wasn't easy, anyhow, I just gave it a try [though I not agree on the naming gremlin's but there are more converter scripts named like this] and I noticed that I had reduced  'Ã''s by two thirds, and, then, I saw, double encodings, for example;

variÃÆ'Ã,«teit
variëteit
variëteit
variëteit

The above example is a three step double encoding-[something] to UTF-8 'ë'.

In my case, I had to de-gremlin, six times. Then I had a couple of thousand 'Ã' in the database, and all of them couldn't be converted back to proper characters by the UTF-8 decoder link I wrote.

Here I used sed and awk;

sed "s/ë/ë/g" foo > bar
awk '{gsub(/BelgiÃÆ'Įââ,¬â,,¢ÃƒÆ'ââ,¬Â ÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢ÃƒÆ'ƒâ€Ã, ÃÆ'Ã,¢Ã¢ââ,¬Å¡Ã,¬Ã¢ââ,¬Å¾Ã,¢ÃÆ'Įââ,¬â,,¢ÃƒÆ'Ã,¢Ã¢ââ,¬Å¡Ã,¬Ã‚Ã, ÃÆ'ƒÂÃ,¢ÃÆ'Ã,¢Ã¢â‚¬Ã...¡Ã‚Ã,¬ÃÆ'Ã,¢Ã¢â‚¬ïÂ/,"België",$0)}1' < foo > bar

[In the awk example, I left the disturbance that was left in the database after de-gremlin six times, so I figure that the UTF-8 decoder read the above as proper. It didn't get shorter. Similar disturbances in the database [most of them] converted back, see 'variÃÆ'Ã,«teit' example, it halfed the 'Mojibake' with every pass.

Since it was taking long to find every 'Ã' by hand, I decided to use iconv;

iconv -t win1252 -f iso-8859-15 foo > bar

This gave error at the first non-convertible character. On a 60Mb database, that was at first at 6Mb and there it stuck about twenty times and then it went up to 40Mb and then it was a cakewalk to fix everything.

A sidenote, the output 'bar' of iconv is not usable. But opening 'bar', scrolling to the last line + END, showed something similar to 'readable text + vari'

Notice that 'ÃÆ'Ã,«teit' of 'variÃÆ'Ã,«teit, was truncated [and everything after this]. Still a iconv converted file without such error [after thirty to fourty sed /awk replacements by me on the original database, so replacement 1 = foo > foo-2, replacement 2 = foo-2 > foo-3 etc. gave a proper UTF-8 encoded database.

And now I use the '$db_character_set = 'utf8';' with the UTF-8 language template and I am sure now that this Mojibake, double encoding or gremlin, won't come back.

But I still fail to see why my database suffered win-1252 encoding, besides that the users used Windows. And they used Windows 7 a lot. And Windows 7 has the 'chcp' tool and it reported codepage 850. For the command prompt. chcp /? also gave a '?' for 'ï' with Windows. On the other hand Grub for Linux with a UTF-8 fresh install also is able to use '?' for accented characters. And I fail to see why this is so. I always had respect for the way Microsoft translated their English versions to all other languages and now this '?', that not many people know about.

But to get back to the subject, can you tell me where the win-1252 encoding came from? All I know is that our users are mostly Dutch and use the American International keyboard [there is a Dutch keyboard that has been made but very few are sold by computer manufacturers] and Belgium, and those people use accents more often then Dutch people. Also, I used iconv and told it to encode, to win-1252 [an educated guess by me based on a succesful de-gremlin], and it failed to translate Russian and Japanese. I replaced those strings with it's Latin1 counterpart and sed /awk'd that after the thirty to fourty runs. In other words, Japanese and Russian people would probably need to find another decoding strategy.

Then, I converted the win-1252 encoded file back to UTF-8 with iconv, and noticed that it's byte-length, was exact the byte-length of my thirty to fourty sed/ awk file. So, I knew then almost for sure that I had repaired correct. Also, I searched for 'Ã' in the database file and there where zero. Then I knew everything was right again, so I replaced the Japanese and Russian to it's UTF-8 [because the encoding used there wasn't accepted by iconv since I asked it to convert to win-1252] and I am running this database now without problems.

And this is where my story stops. I'm still under the impression that SMF can know by it's own if it should check the UTF-8 box at installing and the only caveat I see is that there mus be a language that requires 'Ã'. I'm lost on your explanaton why this setting is required to put in correctly and also, I'm not aware of a succesful UTF-8 conversion attempt by SMF. If that one is succesful, or for nine out of ten, shouldnt it be in the installer. Also, the installer could ask to copy all tables to another three letter + underscore and give a revert script? That way the people have a a way to try things without knowing it's ins and outs and from what I read here most people do, and I don't read the manual indeed.

QuoteI think you misunderstood what I said, that SMF put UTF-8 characters into the database and everything else is really treating them as ISO-8859-1 (which is not the same as Win-1252 though in most cases it's 'close enough')

You are right, I understand, but I fail to understand this. What other encoding to expect then win-* with so many Windows clients? My forum has one Linux user but that is me and one Mac. And chcp didn't come with another codepage then 850 somehow. I've read on the internet writings by people I feel are biased that Microsoft has been the culprit but I view their encoding scheme just as good/ or bad, then another one. Also I read UTF-8 can make error, I fail to see how that can be possible, everything converts to UTF-8 I learned, but the result can be that you are 'further away from your house' [dutch expression]


QuoteSee above. There was a time during SMF's history when not going UTF-8 was a viable choice. There is also the reason we (I, personally, in fact) made the choice to not do that in 2.1, such that 2.1 only installs UTF-8.

Can you tell me how the pre-UTF-8 SMF forum posts with non-ascii will look like?

QuoteIt's actually really simple. You have to know how to talk to the database from the off. As in, literally, the first command issued to the server after connecting is 'use UTF-8'. You need to know that pretty much immediately and the best place to put that is in Settings.php.

I agree on what you are saying, however, I didn't know what a database or php was when I started. I'm able to fail on a Wamp server installation and now with linux I found a box that talks back to me.

QuoteOh, it really, really isn't that simple. When the ISO encodings first came out, space was tight. I'm talking back in the 1980s when individual bytes were a concern. Which is why all the ISO encodings are single bytes per character.

The UTF family came along much later on, first with UCS-2 (two bytes per character, as used throughout the internals of Windows NT) and UCS-4 which later became UTF-16 and UTF-32. UTF-8 is relatively new in a lot of ways, and is much more complex to process because you don't know at the start how many bytes per character. You can't know until you start to examine the characters how many characters there are.

The first byte in a character tells you how many bytes make up that character, based on the arrangement of the first three bits of the number. Because this is massively more complex (and they trimmed it down, UTF-8 was originally supposed to allow 6-byte characters, now it only allows 4-byte characters as a maximum)

I get your first two paragraphs and your third is unclear to me.

QuoteAll of the ISO encodings, plus Win-1251 through Win-1253 are all types of Extended ASCII. Officially, ASCII denotes the values for 0 to 127. Everything above that is officially extended, just in different ways, like how IBM defines code pages like CP 437 which defined most of the higher characters to be text characters for drawing boxes on the screen, or 850 which covered most of the things needed by France and Germany (and is much the same as ISO-8859-1 is today). Or how we have ISO-8859-6 which defines the top 128 values as a subset of Arabic.

UTF-8 is merely a different kind of encoding which specifies that the top set of values indicate continuation. Values 0 to 127 remain the same as ASCII, with all the values above indicating that they are parts of larger characters - the first byte tells you how many bytes follow for the rest of the character (with values 0-127 always indicating a one byte character)

The real problem can come in with Win-1251 and Win-1252 which has a habit of using quote symbols that tie into the range above 128 (in such a way that they are the second character in a block of bytes forming one character), but since they don't have a correct preceding character, they can end up damaging content.

I'm not 100% clear on exactly what damage you had or what repairs are being done but all the issues sound like there was a mismatch between what the database was using and what SMF thought the database was using - and this can happen if the language packs mismatch since the language packs indicate for ISO which codepoints mean what. It is the language pack that knows to differentiate between ISO-8859-1 and ISO-8859-6 for example - because they're single byte encoded, the database does not need to care so it leaves it to the language pack to make the decision. But for ISO-anything vs UTF-anything, you need to get it right otherwise it's going to cause you trouble.

I understand your first and second paragraph. Your third paragraph, you are probably right, that there can be damage with win-* encoding, but, can you explain what you mean by 'quote symbols that tie into the range above 128' this didn't translate very well to my native language. Also, what is an example of a incorrect preceding character? Can you even explain how it damages content. Do I understand from this that this is why people feel that Microsoft encoding is bad? Can you tell if it was Microsoft's encoding scheme that resulted in irrepairable database damages?

I didn't tell yet I found some control codes, but the UTF-8 decoded converted this back properly. Also I had some unreadable [dropped by the UTF-8 decoder] characters appearing to meant to be 'space', so I let sed/ awk replace it with '' and there also was some garbled encoding that I could delete safely too and it seemed like transmission failures and on a 60Mb file about five or so.

For your fourth paragraph, could it be that I live in a country that uses Latin15 and that SMF was in Latin1? Can you explain to me how SMF then was able to store a 'ë' for example not being made UTF-8 aware? And how is it possible that SMF takes responsibillity for that, if it is true that what you send to the database also comes back in the same byte sequence? Because I know that my six conversion step is the repair of the result I had by trying 'the two switches' and for the others, I am also wondering if my ISP upgraded something, because ISP tend to upgrade without notice and that wouldn't be the first time something that was running borked somewhere. And I asked them to notify us before they upgrade, but they just don't. And finally there is php7 appearing, and they decided to leave php5 alone. I also don't respect php's way of implementing strict standards, I respect Apache and MySQL and I respect this forum for it's knowledge.

There is one thing I want to let you know.

I talked about the test message and what came back;

'à á ä ò ó ö è é ë ï' <> 'à á ä ò ó ö è é ë ï'

Using the UTF-8 decoder, the first accented character, 'à' doesn't come back. Every single 'Ã' is nothing, I think it says; 'escape', and so it did. haha.

However, that letter has been lost somehow, and that is an error. I failed to investigate this, but I have the old database so I am able to reproduce, that is. for the mangled database I've had and for the loss of the accented character.
If my post in this topic looks ambiguous to you, then I'm with Murphy's law and General Stupidity. In other words, trial and error.

Advertisement: