iPhone smileys (emojis) cuts off posts

Started by hebron, July 30, 2014, 05:41:38 AM

Previous topic - Next topic

jonesH

The truncation happens all the same on my setup and posting from an iPad (so it's the iOS rather than the iPhone alone). The lost text seems to include the space preceding the first emoticon encountered.

Arantor

OK, so I'll do some analysis on this later tonight when I'm back at home, and I'll pass the details to the team to incorporate into the next patch since this strikes me as a fairly important detail.

I would not blindly throw things like HTML Purifier at it since it neuters other things that you might legitimately want to add as an admin.

JBlaze

There's a nice conversion table located here that could help.
Jason Clemons
Former Team Member 2009 - 2012

Arantor

I don't get why those are futzing around with it though, they're not high enough to be pushing UTF-8mb4 as far as I can see.

Need to sit and watch this one with a packet sniffer to get actual traffic and a bytesafe catcher during post processing on the server to see what is going on - but I'm not at home and don't have any iDevices with me.

JBlaze

I'm not sure why SMF is truncating the emoji characters and anything after it, but I do know that iOS uses a specific font that is pre-installed on the device to display and utilize the emojis.

Once the issue with the message being truncated is fixed, in order to display them properly, a third-party font must be used. There are a few that can be found on GitHub.
Jason Clemons
Former Team Member 2009 - 2012

Arantor

It's truncating it because it doesn't like it. It smells like an encoding issue since SMF is jittery around invalid codes for crap like that, which is why I want to apply a packet sniffer and watch what physical bytes are sent and exactly why SMF is choking on it.

As for display, that's a separate ball game since for iOS users it'll already be legitimate. Right now I'm simply interested in why it's choking.

hebron

It would be nice to display them correctly, but the most pressing issue is not cutting of the posts. I've had a few users who have had their whole introduction post lost because they started it with "Hi (emoji)". Not a good first impression to give to new users.

As for this bug/issue should I file it in the bug system?

Arantor

Don't worry, I got this one. I'll report it on the official tracker once I have more information.

hebron

Quote from: ‽ on July 30, 2014, 05:57:30 PM
Don't worry, I got this one. I'll report it on the official tracker once I have more information.
Thank you all for your attention on this :)

hebron


Arantor

Sorry, I forgot to mention about this.

I worked out what it was. Fixing it is, however, a much larger problem.

Technical explanation: most normal UTF-8 stuff we deal with is 3-byte, the iPhone emoji are 4 byte. Standard UTF-8 doesn't cover 4 byte encoding, we have to specifically use a thing called utf8mb4.

Now, if all the database were converted to utf8mb4 everywhere, and the set-names query updated to set names to utf8mb4, there's a reasonable chance it would work as expected.

Part of me is wondering if converting it to an entity form before saving it to the database (to avoid all of the above) but that fall foul of other checks and balances inside SMF.

hebron

Thank you the explanation :) I did have some luck using an HTML-purifier, however I did not manage to implement it correctly. You mentioned that it might also clean out some stuff that the admin wanted to insert, which is a very good point. What are your thoughts about writing a "SMF-complaint" "cleaner" to at least stop the posts from getting cut?

I've tried sending these symbols to my outlook.com mailbox, it does not display the symbol correctly however it does not cut the post. It instead shows a unfilled black square. Same symbol I saw on my old Blackberry whenever someone would send me some of these emotions.

Arantor

There's already a cleaner involved in the bowels of SMF, you do not need to implement your own cleaner to validate HTML or anything. In any case if some kind of sanitiser is fixing this, I'd be tempted to call it a bug because it shouldn't be pruning the characters.

MySQL only does because 4-byte UTF-8 characters are not supported in the standard UTF-8 character set encoding or collation.

An unfilled black square indicates a valid character with no glyph available to render the character, but that it has correctly acknowledged the character as it should.

hebron

I read a couple of posts on MySQL and utf8mb4, I think I'm going to do some experimentation on my virtual test setup. I found an article explaining the conversion from utf8 to utf8mb4, and it seems like quite a tedious task. http://mathiasbynens.be/notes/mysql-utf8mb4

I'll setup a test bed with SMF, MySQL (or MariaDB as I am currently using) with utf8mb4 and do some testing :)

Arantor

You will need to modify Load.php as a minimum to update the SET NAMES query. Everything else... off hand no idea.

hebron

Quote from: ‽ on August 05, 2014, 05:54:33 AM
You will need to modify Load.php as a minimum to update the SET NAMES query. Everything else... off hand no idea.
Thanks for the hint, I rather enjoy digging in PHP code :) So it will be a fun thing to play with after the wife and kids have gone to bed and I've grabbed myself a beer. Let you know how it goes when I have found the time for it :)

Arantor

Have fun :) The big problem is that there's all kinds of edge cases with the UTF-8 handling that people may not remember or be aware. Let me know how you get on :)

filmstarr

If it helps I've experienced the exact same issue, but only on my SMF 2.0.8 test installation. My (albeit heavily modified) SMF 2.0 RC3 installation works absolutely fine with iPhone emoji :-\

Arantor

Interesting, since that code has not changed since then... I do know a future-proof patch has been discussed by developers however.

Meanwhile there are *dozens* of security holes that have been fixed since RC3. You really should investigate upgrading.

filmstarr

Quote from: Arantor on September 03, 2014, 06:16:04 PM
Meanwhile there are *dozens* of security holes that have been fixed since RC3. You really should investigate upgrading.
Thanks, yes I've been pulling in a number of the changes, but the code in my site is so heavily modified that it would take me a few days to sort it all out. It's non-public facing though, so less of a concern! :)

Advertisement: