RSS encoding issue

Started by AliG, December 12, 2018, 07:36:21 AM

Previous topic - Next topic

AliG

Hi,
there is an issue with RSS. When I use BEL character (Ctrl-G via command line, 0x07), it will break RSS. Shouldn't it be encoded somehow?

This is the char. Now, if you try to open RSS, it will show you that there is encoding issue and you won't be able to use RSS for some time.

Arantor

Why are you using the BEL character in the first place?

AliG

When you try to send a batch text file in the CODE block which contains it ...

Arantor

Well... it's legal UTF-8 as evidenced by how it survives in the post itself.

It's definitely an edge case in RSS feeds, as it isn't supported in XML 1.0 though it is in XML 1.1 but listed as highly discouraged.

The problem is... what to actually encode it as in that situation? I could conceivably see it becoming a numeric entity but there's no guarantee it would be handled correctly by RSS feed parsers.

Is it a bug? I think so, but it's also one of those horrendous corner cases that wouldn't normally ever come up (I didn't think anyone emitted literal BELs in batch files as there should be better ways to express that), and if I'm truly honest part of me thinks the correct fix is to actually break your use case anyway. I'd argue that the character should be converted at post save time to U+2407 which is the Unicode glyph that actually has visible content, which is kind of important in a forum, but sucks for your use case.

The alternative is to simply attach the file rather than copy paste it into a code block.

AliG

Can it be encoded like  or  ?

Arantor

That was what I meant about numeric entity form, but there's no guarantee it would be handled correctly anyway. Especially as the core content is emitted as CDATA fields where entity form is "supposed" to not be used. So, no, probably not.

I still think the correct solution here is to attach the batch file rather than trying to fix a problem that ultimately is more specification level than anything else.

AliG

I understand your point but usually people put some short snippets into the CODE section.
It is really rare situation. It can be marked as solved if there is no simple solution for that.

Arantor

True, but they don't usually put BELs in there ;)

It's ultimaely not my call to make, though.

Sesquipedalian

According to the W3C spec, the only low ASCII control characters allowed in XML 1.0 are tab, line feed, and carriage return. Apparently this is true even when the characters are represented as entities (e.g. as  for the BELL character). All the control characters except NULL can be used in XML 1.1, but virtually no one uses XML 1.1 and the RSS spec requires XML 1.0. So there is simply no way to include a BEL character in an RSS feed.

The only improvement we could make in SMF would be to strip out the disallowed characters when generating the feed.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

Advertisement: