Language question and offer

Vector’s Shadow · April 07, 2013, 01:58:48 AM

I have installed SMF 2.0.4 in an attempt to create a bilingual forum. Didn't pay much attention to the fact that the other language had to be UTF-8 (as decided by me), but the default English wasn't. So now I downloaded the UTF-8 package for English. It looks exactly like the regular package with the only difference being that it claims to be UTF-8 in index.english-utf8.php. I don't want to install this and then try chasing down and removing the default English files. All I really want to do with the English version is to have the overused (and abused!) punctuation marks like the straight single quote exchanged with better representations, like ‘ and ’, for example. I would have done just that, but, naturally, nothing is ever easy, and titles, for instance, get sanitized, so I get text like "Who&rsquo;s Online" in the titlebar code when I do this. I need to use characters instead of entities

. So what's the point of having a package that differs only in what the forum thinks the package is? A more practical question: Can I just change the corresponding string in the default language files to en_US.UTF-8 and start using using UTF-8 symbols for quotes and dashes (saving with UTF-8 encoding, obviously), or would that not be a kosher way to do this?

@This Forum admins: Also, it's really annoying to see that when my users will (nobody on the forum yet) try to use Help in the language other than English, they will get a rather nasty empty page that tells them exactly nothing useful for many topics. I could create reasonably high-quality translations for some of those pages, especially shorter ones, and could make an attempt to make a small number of existing translations less "techy" (since they are, after all, for the users, but are written as if for veteran users/admins).

kat · April 07, 2013, 06:38:46 AM

I've passed this on to the Localization team.

emanuele · April 07, 2013, 06:50:29 AM

Quote from: Vector's Shadow on April 07, 2013, 01:58:48 AM
So what's the point of having a package that differs only in what the forum thinks the package is? A more practical question: Can I just change the corresponding string in the default language files to en_US.UTF-8 and start using using UTF-8 symbols for quotes and dashes (saving with UTF-8 encoding, obviously), or would that not be a kosher way to do this?

When you installed SMF did you select the "UTF8" checkbox?

Quote from: Vector's Shadow on April 07, 2013, 01:58:48 AM
@This Forum admins: Also, it's really annoying to see that when my users will (nobody on the forum yet) try to use Help in the language other than English, they will get a rather nasty empty page that tells them exactly nothing useful for many topics. I could create reasonably high-quality translations for some of those pages, especially shorter ones, and could make an attempt to make a small number of existing translations less "techy" (since they are, after all, for the users, but are written as if for veteran users/admins).

It's a wiki, as soon as you will reach 10 messages here on the forum (anti-spam measure...sorry, but I think it can be by-passed) you will be able to edit those pages...well, at least some of those, it may be necessary to add you to a group, but that wouldn't be a big issue, just ask.

Vector’s Shadow · April 07, 2013, 04:59:13 PM

Quote from: K@ on April 07, 2013, 06:38:46 AM
I've passed this on to the Localization team.

Thanks

.

Quote from: emanuele on April 07, 2013, 06:50:29 AM
When you installed SMF did you select the "UTF8" checkbox?

Yes, I did. But I'm talking about the files found in smf_2-0-4_english-utf8.zip, so that shouldn't make a difference, as I understand it.

Quote from: emanuele on April 07, 2013, 06:50:29 AM
It's a wiki, as soon as you will reach 10 messages here on the forum (anti-spam measure...sorry, but I think it can be by-passed) you will be able to edit those pages...well, at least some of those, it may be necessary to add you to a group, but that wouldn't be a big issue, just ask.

Thank you for pointing it out. I realized that there's a posting limit when I saw how few options I could set in the profile. I saw this mod in a couple of other forums. On one hand, it's great for limiting spam; on the other, it teaches newcomers to that you prefer quantity over quality (even if that's not true)...

Kindred · April 07, 2013, 08:11:37 PM

10 posts is hardly quantity.....

As for your comments, I guess I am unsure what you actually mean.
The files for the UTF8 versions of that languages have UTF8 cahartacters in them, while the other ones do not. I not think that any of the normal language files use HTML entities.

Oldiesmann · April 07, 2013, 08:15:00 PM

The normal ones do use HTML entities, but only for certain characters (generally any unicode characters). The UTF-8 files don't.

Arantor · April 07, 2013, 08:17:22 PM

QuoteI not think that any of the normal language files use HTML entities.

Actually a lot of them do, especially the non-English ones. But even the English ones have some in. And yes, even the UTF-8 ones might. I've spent great amounts of time with the language files

The biggest reason for not just fudging the language pack item is in fact the database. You have to make sure the database is properly set to whatever you're actually using - if you didn't set UTF-8 on install and didn't convert, you'll mostly be using ISO-8859-1 which is a single-byte set, as opposed to UTF-8 which is a multi-byte set. The consequences of this are primarily that *all* non basic language characters (A-Z, a-z, 0-9, standard punctuation) will be damaged in saving. Even down to things like £ and € will be damaged as they are not part of the original ASCII set, nor part of ISO-8859-1.

Vector’s Shadow · April 07, 2013, 09:15:23 PM

Quote from: Kindred on April 07, 2013, 08:11:37 PM
10 posts is hardly quantity.....

When you want quality posts, 1 useful post is much better than 10 semi-useful ones. The requirement to make a certain number of posts means that visitors are encouraged to make a number of semi-useful and even downright meaningless posts just to meet the limit. This teaches them from the get-go that it's OK to make such posts and is preferred to meaningful ones.

Quote from: Kindred on April 07, 2013, 08:11:37 PM
As for your comments, I guess I am unsure what you actually mean.[...]

Easy. When I see a file labeled as using UTF-8, I expect multibyte characters in it. The English UTF-8 file pack contains none of them, even though it potentially could. E. g., all apostrophes could have been replaced with the correct typographical symbol for it instead of the straight one, "\'". But that's not the case. Files in UTF-8 package all contain only ASCII characters. So I'm saying, what's the point of having the UTF-8 package in the first place if it uses no UTF-8 characters? I downloaded it expecting to see them in place of ASCII punctuation.

Arantor · April 07, 2013, 09:21:12 PM

QuoteEasy. When I see a file labeled as using UTF-8, I expect multibyte characters in it.

Even when most of its content won't be multibyte anyway. Even when some of the entity use is not for byte safety.

Quoteall apostrophes could have been replaced with the correct typographical symbol for it instead of the straight one

No, they can't. Firstly, you have to deal with the fact that 99% of the ' characters are for PHP's benefit. Secondly, you have to deal with the fact that there's an awful lot of cases of inline markup, which may or may not use " or ' in it and there's no way to automatically generate the correct code in that situation.

There are also lots of fun issues with people editing files through their cPanel and generating damaged files out of it.

Quotewhat's the point of having the UTF-8 package in the first place if it uses no UTF-8 characters?

Mostly because the language system in SMF has always been a bit... naive.

It is far from impossible to have the database in a different format to what the language files indicate. That's about the only reason for the nature of it being that way - so that it forces the *database connection* to be UTF-8.

Trust me when I say I sympathise with what you're going through. I still need to go back through my language files again for that reason.

Vector’s Shadow · April 07, 2013, 09:35:23 PM

Quote from: Arantor on April 07, 2013, 09:21:12 PM
Quoteall apostrophes could have been replaced with the correct typographical symbol for it instead of the straight one

No, they can't. Firstly, you have to deal with the fact that 99% of the ' characters are for PHP's benefit. Secondly, you have to deal with the fact that there's an awful lot of cases of inline markup, which may or may not use " or ' in it and there's no way to automatically generate the correct code in that situation.

There are also lots of fun issues with people editing files through their cPanel and generating damaged files out of it.

Whoa, are we on the same page here? I'm talking about cases like Who's Online being written as Who's Online (look closely at the differences in the appearance of the apostrophe). I'm pretty sure that's what UTF-8 is for in the first place. There are legitimate uses of the straight apostrophe as part of PHP, obviously, but I'm not suggesting changes like

Code Select

$txt['index'] = 'blah'; to

Code Select

$txt['index'] = 'blah'; That would just throw a syntax error.

Quote from: Arantor on April 07, 2013, 09:21:12 PM
Quotewhat's the point of having the UTF-8 package in the first place if it uses no UTF-8 characters?

Mostly because the language system in SMF has always been a bit... naive.

It is far from impossible to have the database in a different format to what the language files indicate. That's about the only reason for the nature of it being that way - so that it forces the *database connection* to be UTF-8.
[...]

Oh, I see. Thanks for that.

Arantor · April 07, 2013, 09:39:25 PM

QuoteWhoa, are we on the same page here? I'm talking about cases like Who's Online being written as Who's Online (look closely at the differences in the appearance of the apostrophe). I'm pretty sure that's what UTF-8 is for in the first place.

Actually, it's really not, but it's one legitimate use.

Here's the thing, and why we're not entirely on the same page.

$txt['who_online'] = 'Who\'s Online'; (not entirely accurate but illustrative)

How many apostrophes are there?

Or more relevantly:

Code Select

$txt['email_variables'] = 'In this message you can use a few &quot;variables&quot;.  Click <a href="' . $scripturl . '?action=helpadmin;help=emailmembers" onclick="return reqWin(this.href);" class="help">here</a> for more information.';

Code Select

$txt['maintain_done'] = 'The maintenance task \'%1$s\' was executed successfully.';

Code Select

$txt['avatar_max_width_external'] = 'Maximum width of external avatar<div class="smalltext">(0 for no limit)</div>';

Which of these should be using proper typographical symbols? See, this is what's actually in the language files (those just happened to be the first examples I found in Admin.english.php in 2.0.4)

That's why it can't be done safely to automatically do it - which means you need someone to go through and change every single instance MANUALLY. There are quite literally THOUSANDS of entries to do by hand. It's an insanely tedious job.

Oh, and even if you do do it, it won't always work properly between the various non UTF-8 instances going on.

This also doesn't account for all the cases where content will be dynamically pulled into forms and so on and needs additional processing.

Vector’s Shadow · April 07, 2013, 09:59:17 PM

Quote from: Arantor on April 07, 2013, 09:39:25 PM
Actually, it's really not, but it's one legitimate use.

Mea culpa, didn't mean to make such a sweeping statement.

Quote from: Arantor on April 07, 2013, 09:39:25 PM
Here's the thing, and why we're not entirely on the same page.

$txt['who_online'] = 'Who\'s Online'; (not entirely accurate but illustrative)

How many apostrophes are there?

Exaclty one. Just use a text editor and replace without regex:
Find: "\'s"
Replace: "’s"

That won't work though because titles are sanitized, so...
Find: "\'s"
Replace: "'s"
and that should work.

Quote from: Arantor on April 07, 2013, 09:39:25 PM
Or more relevantly:
Code Select Expand
$txt['email_variables'] = 'In this message you can use a few "variables". Click <a href="' . $scripturl . '?action=helpadmin;help=emailmembers" onclick="return reqWin(this.href);" class="help">here</a> for more information.';

This one is way too easy:
With regex (I'm not sure this is the right regex, but it gives you the idea)
Find:Find: "\"(.*)?\""
Replace: "“$1”"

You can go even further and define, for easier localization,

Code (php) Select

$txt['left_outer_quote'] = '&ldquo;'
$txt['left_inner_quote'] = '&lsquo;'
$txt['right_inner_quote'] = '&rsquo;'
$txt['right_outer_quote'] = '&rdquo;'

and then replace with '.$txt['left_outer_quote'].$1.$txt['right_outer_quote'].'

Quote from: Arantor on April 07, 2013, 09:39:25 PM
Code Select Expand
$txt['maintain_done'] = 'The maintenance task \'%1$s\' was executed successfully.';

Code Select Expand
$txt['avatar_max_width_external'] = 'Maximum width of external avatar<div class="smalltext">(0 for no limit)</div>';

Almost the same replacement as above, just a different search string; for first one, both \' should have been double-quotes anyhow, and the second one has nothing to replace in it.

So that's if you don't want to do this manually, but localizers need to touch each string anyhow anyhow since they are supposed to translate every string anyway.

Arantor · April 07, 2013, 10:05:48 PM

I see where you're going (and believe me, I thought of it before entering this debate) but there are still all kinds of crazy edge cases for something like this. Like content that can and will be pulled by the forum in a non UTF-8 context. (SSI comes to mind) Straight up entity replacement would be OK, but using UTF-8 characters would be a no-no there.

* Arantor should note, he has done localisation, in translating English US to English UK. Even that is a massive enough job and that requires almost no changes.

Quoteand then replace with '.$txt['left_outer_quote'].$1.$txt['right_outer_quote'].'

Off the top of my head I can think of places that's 1) going to miss cases and 2) likely to break something else in the process.

Seriously, please stop trying to find 'easy answers'. I've been at this for a long time and there aren't any - you might if you're lucky get a modest fraction of the way there, but you won't get nearly as far as you think you will. If you're so convinced there are, why not apply them yourself? You could even submit them as a pull request to SMF 2.1 on Github.

I would note, I spent not even 2 minutes finding those examples - that was just in the first file I found.

There are also *numerous* in the instances in the language files that are actually explicitly commented with 'use numeric entities in the next line(s)'.

Vector’s Shadow · April 07, 2013, 10:32:44 PM

There are easy answers? LOL, I didn't know.

I also didn't know this was a debate... I was just trying to help you understand what I was trying to say. Have I really failed at it that badly?

Quote from: Arantor on April 07, 2013, 10:05:48 PM
There are also *numerous* in the instances in the language files that are actually explicitly commented with 'use numeric entities in the next line(s)'.

Isn't that just for cases where you use XML/RSS and whatnot? HTML has those entities defined in the DTD, and XHTML, in the form in which it's used by the forum software, also seems to have them there. They validate. In addition, it just means to not use non-numeric entities, but UTF-8 characters might, in fact, be OK there.

Quote from: Arantor on April 07, 2013, 10:05:48 PM
Quoteand then replace with '.$txt['left_outer_quote'].$1.$txt['right_outer_quote'].'

Off the top of my head I can think of places that's 1) going to miss cases and 2) likely to break something else in the process.

The same can be said for anything. Not a good reason for someone who wants to try to stop in his tracks. You must check your work, yes. So what? Developers are inherently lazy? Been there, done that.

Note, too, that it was just an example/suggestion and not the final code.

Quote from: Arantor on April 07, 2013, 10:05:48 PM
[...]If you're so convinced there are, why not apply them yourself? You could even submit them as a pull request to SMF 2.1 on Github.

...which is why I asked in the first place. I said,

Quote from: Vector's Shadow on April 07, 2013, 01:58:48 AM
A more practical question: Can I just change the corresponding string in the default language files to en_US.UTF-8 and start using using UTF-8 symbols for quotes and dashes (saving with UTF-8 encoding, obviously), or would that not be a kosher way to do this?

I haven't used SSI much and thus don't know of their challenges. I've always considered SSI to be just a toy that's way too easy to break to be of any serious use. Do tell what can't be done with it in terms of character encoding, I'm very interested.

Arantor · April 07, 2013, 10:45:31 PM

QuoteThere are easy answers? LOL, I didn't know.

Well, you're looking for them

QuoteIsn't that just for cases where you use XML/RSS and whatnot?

AJAX (which affects quoting of posts), oh and emails.

QuoteIn addition, it just means to not use non-numeric entities, but UTF-8 characters might, in fact, be OK there.

No, it's quite specific:

Code Select

// Use numeric entities in the below string.

9 instances of that in index.english.php

EmailTemplates.english.php even has this to say:

Code Select

// Since all of these strings are being used in emails, numeric entities should be used.

Though it should be noted that the developers who wrote those lines of code have long since disappeared off the radar so it's not even possible to ask them about them.

These are cases where they are explicitly outside of what the DTD will mandate.

QuoteNot a good reason for someone who wants to try to stop in his tracks. You must check your work, yes. So what? Developers are inherently lazy? Been there, done that.

Oh, if you want to, go nuts. Just be prepared for the work involved. I repeat: I am not talking hypothetically about this stuff. I am known around here for working on what is by now almost certainly the *most* modified SMF installation ever. Like this evening, I have been removing the old warning system and replacing it with a new one.

A couple of weeks ago I finished up mostly rewriting the language-items editor inside the admin panel (yes, you can actually edit the language items directly from there if you so choose). I would also note that I spent some time actually removing all non-UTF-8 handling (except from email, where that's actually needed), so what I have is straight UTF-8, can't be anything else. And I still haven't been back through everything to figure out what can be safely made UTF-8 characters and what can't, and what can safely be made non numeric entities and what can't. But there are plenty of places where even entities are inherently problematic.

QuoteI haven't used SSI much and thus don't know of their challenges. I've always considered SSI to be just a toy that's way too easy to break to be of any serious use. Do tell what can't be done with it in terms of character encoding, I'm very interested.

Oh, it's way too easy to break because no-one knows how to actually use it properly.

In the case of using it though, in virtually every case, it's dumping content into a page that is not using the forum wrapper - which means the content type meta tag isn't introduced, which means it's dumping UTF-8 (or whatever it's defined to use) and has absolutely no idea if that's correct or not. More importantly, it just assumes it is.

Now, if you're using it on an external page, that's fine provided you ensure the external page has the correct encoding - or you use the forum wrapper.

As for being a toy, this very site uses it a surprising amount. IIRC the mod site is practically based on it, but it's been years since I looked at the code there, it's possible it uses the other site integration tools. The front page on this site definitely uses SSI though, as does everything that isn't the forum on http://simpledesk.net/ (I know about that one because I helped write it, like the documentation area code that is ugly but still uses SSI)

Vector’s Shadow · April 07, 2013, 11:05:31 PM

Quote from: Arantor on April 07, 2013, 10:45:31 PM
QuoteThere are easy answers? LOL, I didn't know.

Well, you're looking for them

If that's how it comes across, well, I guess I need to work on my presentation.

Quote from: Arantor on April 07, 2013, 10:45:31 PM
QuoteIn addition, it just means to not use non-numeric entities, but UTF-8 characters might, in fact, be OK there.

No, it's quite specific:

Code Select Expand
// Use numeric entities in the below string.

9 instances of that in index.english.php

EmailTemplates.english.php even has this to say:
Code Select Expand
// Since all of these strings are being used in emails, numeric entities should be used.

Though it should be noted that the developers who wrote those lines of code have long since disappeared off the radar so it's not even possible to ask them about them.

Are you saying it's that hard to deduce by reading the code? It's not like it's compiled...

Bottom line.

What I don't understand is why, at least according to you, it is such a problem to use some UTF-8 characters in an English pack while every other language has no such issue with them. After all, translating into, say, Chinese, you can't not use special characters, and, chances are, you will be using a UTF-8 pack. So why is it so problematic with the English language pack? In terms of the way things get programmed, what stops SMF from using UTF-8 strings everywhere and converting them to ASCII or whatever on the fly for, say, e-mail generation? And what if the e-mail is also in Chinese? I'm afraid I have more questions than answers after this.

Arantor · April 07, 2013, 11:11:27 PM

QuoteAre you saying it's that hard to deduce by reading the code? It's not like it's compiled...

There is a lot of code. Finding some of the references is almost by luck at times. Excluding the language entries, you're still in need of wading through something like 100,000 lines of code to track down where things are used. And it's not always really not that obvious where one effect will be applied elsewhere in the code.

QuoteWhat I don't understand is why, at least according to you, it is such a problem to use some UTF-8 characters in an English pack while every other language has no such issue with them.

You didn't even answer the earliest question: is your database in UTF-8?

QuoteAfter all, translating into, say, Chinese, you can't not use special characters

Actually, you can do it all with numeric entities. The UTF-8 language pack doesn't do that, and they have been demonstrated to be buggy.

Bottom line: if you're so convinced it should be easy, go do it. I tried it, it didn't end that well, but what do I know?

EDIT: It really doesn't help that two separate issues have been thoroughly conflated beyond all separation.

Vector’s Shadow · April 07, 2013, 11:23:26 PM

Are you looking for a fight? I never said I doubted your prowess. So don't make it look like I did. Take it easy.

But I am not without experience myself, just perhaps not enough with this particular piece of software. So I'm trying to see what's different about it. And you make it sound like it's an impossible mission.

I did answer the question. Read:

Quote from: Vector's Shadow on April 07, 2013, 04:59:13 PM
Quote from: emanuele on April 07, 2013, 06:50:29 AM
When you installed SMF did you select the "UTF8" checkbox?

Yes, I did. But I'm talking about the files found in smf_2-0-4_english-utf8.zip, so that shouldn't make a difference, as I understand it.

More explicitly: My database is UTF-8. That's irrelevant. The question is about files and has absolutely nothing to do with the database.

Vector’s Shadow · April 07, 2013, 11:34:51 PM

Quote from: emanuele on April 07, 2013, 06:50:29 AM
Quote from: Vector's Shadow on April 07, 2013, 01:58:48 AM
@This Forum admins: Also, it's really annoying to see that when my users will (nobody on the forum yet) try to use Help in the language other than English, they will get a rather nasty empty page that tells them exactly nothing useful for many topics. I could create reasonably high-quality translations for some of those pages, especially shorter ones, and could make an attempt to make a small number of existing translations less "techy" (since they are, after all, for the users, but are written as if for veteran users/admins).
It's a wiki, as soon as you will reach 10 messages here on the forum (anti-spam measure...sorry, but I think it can be by-passed) you will be able to edit those pages...well, at least some of those, it may be necessary to add you to a group, but that wouldn't be a big issue, just ask.

Sorry for double-posting, really should've made two different topics, but...

QuoteYou do not have permission to edit this page, for the following reason:

You do not have permission to edit pages in the Translations namespace.

Looks like I will need to be added.

Arantor · April 07, 2013, 11:51:53 PM

QuoteMore explicitly: My database is UTF-8. That's irrelevant. The question is about files and has absolutely nothing to do with the database.

Not to your original question, it wasn't.

QuoteAre you looking for a fight? I never said I doubted your prowess. So don't make it look like I did. Take it easy.

Well, you're telling me how easy it all is. So I'm saying to you that you should try it. Some of it will work, some of it not so much. And the bits that don't work properly are going to require tracing through the code to see what should have been used.

It's not an impossible mission, but it's far more complex than you seem to believe. I'm trying to warn you off spending a vast amount of time on what is, in the scheme of things, a minor aesthetic change.

News:

Language question and offer

kat