Simple Machines Community Forum

SMF Development => Bug Reports => Topic started by: AliG on December 12, 2018, 07:36:21 AM

Title: RSS encoding issue
Post by: AliG on December 12, 2018, 07:36:21 AM
Hi,
there is an issue with RSS. When I use BEL character (Ctrl-G via command line, 0x07), it will break RSS. Shouldn't it be encoded somehow?

This is the char. Now, if you try to open RSS, it will show you that there is encoding issue and you won't be able to use RSS for some time.
Title: Re: RSS encoding issue
Post by: Arantor on December 12, 2018, 07:38:11 AM
Why are you using the BEL character in the first place?
Title: Re: RSS encoding issue
Post by: AliG on December 12, 2018, 07:39:42 AM
When you try to send a batch text file in the CODE block which contains it ...
Title: Re: RSS encoding issue
Post by: Arantor on December 12, 2018, 08:10:15 AM
Well... it's legal UTF-8 as evidenced by how it survives in the post itself.

It's definitely an edge case in RSS feeds, as it isn't supported in XML 1.0 though it is in XML 1.1 but listed as highly discouraged.

The problem is... what to actually encode it as in that situation? I could conceivably see it becoming a numeric entity but there's no guarantee it would be handled correctly by RSS feed parsers.

Is it a bug? I think so, but it's also one of those horrendous corner cases that wouldn't normally ever come up (I didn't think anyone emitted literal BELs in batch files as there should be better ways to express that), and if I'm truly honest part of me thinks the correct fix is to actually break your use case anyway. I'd argue that the character should be converted at post save time to U+2407 which is the Unicode glyph that actually has visible content, which is kind of important in a forum, but sucks for your use case.

The alternative is to simply attach the file rather than copy paste it into a code block.
Title: Re: RSS encoding issue
Post by: AliG on December 12, 2018, 08:16:41 AM
Can it be encoded like  or  ?
Title: Re: RSS encoding issue
Post by: Arantor on December 12, 2018, 08:22:15 AM
That was what I meant about numeric entity form, but there's no guarantee it would be handled correctly anyway. Especially as the core content is emitted as CDATA fields where entity form is "supposed" to not be used. So, no, probably not.

I still think the correct solution here is to attach the batch file rather than trying to fix a problem that ultimately is more specification level than anything else.
Title: Re: RSS encoding issue
Post by: AliG on December 12, 2018, 11:13:23 AM
I understand your point but usually people put some short snippets into the CODE section.
It is really rare situation. It can be marked as solved if there is no simple solution for that.
Title: Re: RSS encoding issue
Post by: Arantor on December 12, 2018, 11:30:06 AM
True, but they don't usually put BELs in there ;)

It's ultimaely not my call to make, though.
Title: Re: RSS encoding issue
Post by: Sesquipedalian on December 19, 2018, 02:09:18 AM
According to the W3C spec (https://www.w3.org/TR/xml/#NT-Char), the only low ASCII control characters allowed in XML 1.0 are tab, line feed, and carriage return. Apparently this is true even when the characters are represented as entities (e.g. as  for the BELL character). All the control characters except NULL can be used in XML 1.1, but virtually no one uses XML 1.1 and the RSS spec requires XML 1.0. So there is simply no way to include a BEL character in an RSS feed.

The only improvement we could make in SMF would be to strip out the disallowed characters when generating the feed.