Feed descriptions can truncate in the middle of a UTF-8 character. [2.0.15]

Started by ajv, October 22, 2018, 04:32:53 PM

Previous topic - Next topic


I've encountered a weird issue where, under possibly-rare circumstances, SMF RSS feeds can truncate their item-description fields in the middle of a UTF-8 character. The relevant part of the feed looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="0.92" xml:lang="en-US.utf8">
                <title>Example Forum</title>
                <description><![CDATA[Live information from Example Forum]]></description>
                        <title>Example Title</title>
<![CDATA[first part of post text...]]>    <----------THIS PART HERE-----------------------------------------------------
                        <author>[email protected]</author>
                        ....and so on...

The "first part of post text" section above is what I'm talking about. As far as I can tell, it's a truncated version of the post's body. I had a case earlier today where the truncation happened in the middle of a a right-single-quote-mark (hex E28099), a three byte character. The truncation happened after the first byte, so the description field ended with an E2 byte, which is not valid UTF-8. I can't find the culprit in the code, but I'm guessing that whatever performs that truncation does so after N bytes instead of N characters, and calculates N wrong.

I noticed the problem because Firefox's Live Bookmarks don't handle the problem gracefully and refused to load the feed. To reproduce, create a post consisting of one ascii character, followed by a couple hundred UTF-8 right-single-quotes; then curl the feed (which will show a garbage character), or try to load it in a live bookmark (which will fail).

Presumably any UTF8 character in a position that crosses the truncation line will trigger the bug; the above is just the first reliable string I found.