News:

Want to get involved in developing SMF, then why not lend a hand on our github!

Main Menu

Problems with Acute Encoding

Started by sriedel, June 25, 2004, 07:22:09 PM

Previous topic - Next topic

sriedel

Hello there,

I run a German community forum using SMF 1.0 Beta 5 Public.

Don't ask me why, but some of my users are used to type an accent character "´" instead of the normal apostrophe (single quote) character. OK, that is a problem of those users ;) an not the BBS, but on my specific setup a double escaping of HTML special chars takes place which leads to literal "´" instead of "´" in the message subjects and bodies.

Example:

The user types:
Quotesingle quote '
double quote "
ampersand &
accent acute ´
accent grave `
accent circumflex ^

After submitting the form via "preview" or "post" you get:
Quotesingle quote '
double quote "
ampersand &
accent acute ´
accent grave `
accent circumflex ^
You always get this result, a) in the message preview b) in the the textarea field of the preview message form and also c) in the resulting message when you directly post it without previewing it.

After posting the message you have the following content in the "body" field of the corresponding record of the "forum_messages" table:
Quotesingle quote &#039;<br />double quote &quot;<br />ampersand &amp;<br />accent acute &amp;acute;<br />accent grave `<br />accent circumflex ^<br />

As you can see from the last quote, the accent acute seems to be "double encoded". I cannot imagine how this happens, as it doesn't happen to the ampersand character, butOK(htmlspecialchars("´")) -> htmlspecialchars("&acute;") -> "&amp;acute;"

Most confusing, all this does NOT happen on most SMF installations I tested, including this board. But it happens on my machine, maybe it is something OS/PHP/Apache related? Here is my current config:
QuoteForum Version: SMF 1.0 Beta 5 Public
PHP Version: 4.3.3
MySQL Version: 4.0.15-Max
Server Version: Apache/2.0.48 (Linux/Sues)
GD Version: bundled (2.0.15 compatible)

Do you have any ideas what happens here?


Best Regards,

Stefan Riedel

[Unknown]

Can I have a link to your forum?  What character encoding are you using?  Have you tried UTF-8?

-[Unknown]

Elissen

I have exactly the same thing with an Icelandic member of my forum. I'll try to get some info from his as well.

andrea

I bet those members have wrong character encoding turned on.

Andrea Hubacher
Ex Lead Support Specialist
www.simplemachines.org

Personal Signature:
Most recent work:
10 Aqua Themes for SMF



sriedel

Feel free to test the described behaviour at our forum: http://www.diegrenzwoelfe.de/forum/ [nofollow]

The default template of my forum is a only slightly modified Classic YaBB SE Theme, so at least regarding the template everything should be fine and the charset of the resulting pages is ISO-8859-1. I don't think it has something to do with the codeset.

BTW: It happens not only to "those" users, but to all users, including myself. I was just talking of "those users" in regard of using the accent acute character, as you normally don't use that char in neither German nor English ;)

Just compose a message, put some accent acute chars "´" in the subject and/or the message body and do a preview and you shoud see those pesky "&acute;" things...

[Unknown]

You are using German, correct?  See what happens if you edit this file:

Themes/default/languages/index.german.php

And replace this:
$txt['lang_character_set'] = 'ISO-8859-1';

With this:
$txt['lang_character_set'] = 'UTF-8';

This could cause problems in the long run, but it's a good way to test what the problem is.

-[Unknown]

sriedel

Correct, the default language of the forum is German. But I already tried the whole thing with my profile set to English and it is the same.  I also tried what you told me to do, [Unknown], changing the 'lang_character_set' of German to UTF-8 doesn't help either.

But I wonder about one thing regarding codeset/encodings. In IE there is a menu "Encoding" where you can turn automatic detection on and off and choose a specific encoding manually. If everything works well, you should have a checkmark at a common encoding, right? If I check this *here*, in the SMF forum, I see a checkmark besides "Western (ISO)" while "Automatic" is on. If I check this setting/detected codeset in *my* forum, I see a checkmark besides a greyed out (!) "Latin 9 (ISO)" -- no matter if my profile is set to English/ISO-8859-1 oder German/UTF-8 (as you recommended).

Ok, this codeset thing gets interesting. I was wondering what happens at HTTP level, so I did check the HTTP headers sent by my server/forum. Here is what I got:
Quote> wget -S http://www.diegrenzwoelfe.de/forum/index.php [nofollow]

1 HTTP/1.1 200 OK
2 Date: Sat, 26 Jun 2004 16:44:02 GMT
3 Server: Apache/2.0.48 (Linux/SuSE)
4 X-Powered-By: PHP/4.3.3
5 Set-Cookie: PHPSESSID=20a5e83733505d6a99b64757dfb51f02; path=/
6 Expires: Mon, 26 Jul 1997 05:00:00 GMT
7 Cache-Control: no-cache, must-revalidate
8 Pragma: no-cache
9 Last-Modified: Sat, 26 Jun 2004 16:44:02 GMT
10 Connection: close
11 Content-Type: text/html; charset=ISO-8859-15

Compare this to what I get from this forum (SMF), and I think we are closing in to the mystery:
Quote> wget -S http://www.simplemachines.org/community/index.php

1 HTTP/1.1 200 OK
2 Date: Sat, 26 Jun 2004 16:44:44 GMT
3 Server: Apache/1.3.31 (Unix) mod_auth_passthrough/1.8 PHP/4.3.3 mod_log_bytes/1.2 mod_bwlimited/1.4 mod_ssl/2.8.18 OpenSSL/0.9.7a
4 X-Powered-By: PHP/4.3.3
5 Set-Cookie: PHPSESSID=ef54337151701e3152de0b3b9174f589; path=/
6 Expires: Mon, 26 Jul 1997 05:00:00 GMT
7 Cache-Control: private
8 Pragma: no-cache
9 Last-Modified: Sat, 26 Jun 2004 16:44:44 GMT
10 Connection: close
11 Content-Type: text/html

As you can see, the SMF webserver does only send a "Content-Type", but no charset. My server additionally sends a "charset=ISO-8859-15", instead. To my knowledge, ISO-8859-15 is ISO-8859-1 plus Euro sign. This shouldn't be a problem for an IE6, but maybe it is?

BTW, the HTTP headers are identical if the default language of the forum is set to German or the English.

sriedel

Wow, I think I found the reason for those problems.   :D

I was really wondering about those HTTP headers I quoted in my last message. Where does this ISO-8859-15 charset come from? And why are there no differences when I change from English with ISO-8859-1 to German with the UTF-8 ([Unknown]'s suggestion). Even more suspicious, where comes that ISO-8859-15 in the headers from, when a look in the HTML source reveals the correct HTTP-EQUIV tags with ISO-8859-1 (when English is set) respectivly the UTF-8 (when German is set).

Searching my local Apache2 config files for "ISO-8859-15" revealed the source of my problem: there was a directive "AddDefaultCharset ISO-8859-15" in my mod_mime-defaults.conf. The apache docs say:
QuoteThis directive specifies the name of the character set that will be added to any response that does not have any parameter on the content type in the HTTP headers. This will override any character set specified in the body of the document via a META tag. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables Apache's internal default charset of iso-8859-1 as required by the directive. You can also specify an alternate charset to be used.

I set the value of this directive to "Off", which is the Apache default anyway, and now everything is perfectly fine, from the correct encodings shown by IE to the correct escaping of the Acute character used by "those users" ;)

BTW: Where the heck did this "AddDefaultCharset ISO-8859-15" come from when Apache says "Off" is the default value? All I can say is that this was a fresh install of a SuSE 9.0 Professional box...  ::)

Thanks to [Unknown] and Andrea for pointing in the right direction!  8)

[Unknown]

I'm not sure where it came from; maybe you installed a package that set that?  Perhaps a European one?

-[Unknown]

sriedel

I am sure that this is the config of a fresh SuSE 9.0 install.

The following is a quote of the SuSE Apache 2 FAQ [nofollow]:
QuoteYou can adjust the default character set in mod_mime-defaults.conf
AddDefaultCharset UTF-8
                          ^^^^^
to the character set you actually use. For example, you could set it to ISO-8859-1 (Latin1), which was the default before SUSE LINUX 9.0.

Note that this setting is effective / needed only for responses where the HTTP headers do not already contain a parameter on the content type (as in META tags). It is explained here: http://httpd.apache.org/docs-2.0/mod/core.html#adddefaultcharset [nofollow]. Furthermore, note that the setting can also be applied in the scope of a virtual host.

Most important, the statement "this setting is effective / needed only for responses where the HTTP headers do not already contain a parameter on the content type (as in META tags)" is plain wrong. In the Apache2 docs there is stated explicitly that this directive overrides any charset setting coming from META tags (see my last message for that quote).

[Unknown]

So tell them, because they should know they are wrong about that ^_^.  Happens, you know?

-[Unknown]

sriedel


Advertisement: