JS preview/final post length check mismatch; Msg HTML truncated after ~64KiB?

Started by mermshaus, May 09, 2013, 05:19:25 PM

Previous topic - Next topic

mermshaus

Hi.

Sorry, this is a rather uninformed post. I don't have the motivation right now to investigate in-depth.

I encountered both issues as a user on a 2.0.1 installation.


First issue: If code with \n line endings is pasted into the "new message" textarea, the JS-based preview function seems to count a line ending as 1 byte whereas the server-side code, that is executed when the message is finally posted, seems to count a line ending as 2 bytes. The reason for this is: Many browsers (tested with Firefox and Chromium) will convert all line endings to \r\n while generating the POST request. This behaviour might be explained in e. g. RFC 2616 (ietf.org/rfc/rfc2616.txt), section 3.7.1:

QuoteWhen in canonical form, media subtypes of the "text" type use CRLF as
the text line break. HTTP relaxes this requirement and allows the
transport of text media with plain CR or LF alone representing a line
break when it is done consistently for an entire entity-body.

This means that it is possible to generate a JS preview of a message with \n line endings that stays within the per message byte limit (no error notification) whereas the very same message will be rejected by the server on submission because the \n → \r\n conversion done by the browser leads to an increased byte length that is no longer smaller than the allowed maximum length.


Second issue: Is it correct that SMF 2.0.1 saves a rendered version (HTML) of a message's BBCode in a database field (?) with a maximum length of 64 KiB? I skimmed for about half an hour through the code on GitHub (latest version, however) but haven't been able to verify this. At least, there doesn't seem to be a field for rendered HTML code in the messages table.

I tried to submit a BBCode-heavy 50 KB message on a forum with a 50 KB per message limit. The message was accepted by the system but the rendered HTML output of the message was truncated after about three quarters of the content. Display stopped inside a [tt] environment with a "&" character being the last one. That character probably indicates an HTML entity (e. g.  ) cut in half. (Sorry, I forgot to check the HTML code of the page to be certain.)

To summarize: I assume that it is possible to write a message with heavy BBCode formatting whose HTML representation exceeds some kind of size limit although the BBCode representation does not. (Simple example to illustrate: [b]Hello world![/b] is shorter than <strong>Hello World!</strong>.)


I am pretty sure that both issues are known (and possibly even fixed). Any thoughts, tips and pointers are very much appreciated.

Thanks!

Marc

Arantor

QuoteMany browsers (tested with Firefox and Chromium) will convert all line endings to \r\n while generating the POST request.

This is primarily operating system, not browser, dependent - CRLF is the line ending for Windows, LF is the line ending for Linux and CR is the old line ending for Mac, so HTTP relaxing the requirement is a cross-platform acceptance.

QuoteSecond issue: Is it correct that SMF 2.0.1 saves a rendered version (HTML) of a message's BBCode in a database field (?) with a maximum length of 64 KiB?

It saves it in almost pure bbcode form. It's complicated. When saving, linebreaks are effectively stripped and converted to br tags. I forget exactly why this is done, though. (Also, in the following when I use KB, I mean what you call KiB. The database doesn't care about power of 10 affectations, it strictly works on power of 2 boundaries and the specifics of the boundaries are indeed around that)

But yes, the maximum length by default is 64KB. If you specify a limit in the admin panel of larger than 64KB, it should expand the database table to much larger (theoretical limit 16384KB, practical limit without specific configuration, typically about 950KB, the practical limit is the size of one query packet which is 1024KB including all the query, including escaping characters, if few escapes are required, the limit edges up towards 1024KB))

QuoteTo summarize: I assume that it is possible to write a message with heavy BBCode formatting whose HTML representation exceeds some kind of size limit although the BBCode representation does not.

Correct.

QuoteI am pretty sure that both issues are known (and possibly even fixed). Any thoughts, tips and pointers are very much appreciated.

Both are known, neither are fixed in 2.0.x, doubtful they are fixed in 2.1, as fixing them has other consequences.

Notably if you expand the size to the next step up, certain options for searching become unavailable to you.

The only reason you're noticing it is because you're high enough up that you're making messages that will overflow occasionally, without being big enough to overflow all the time.

Quickest fix will be to go to Admin > Forum > Posts and Topics > Post Settings and adjust the maximum size of the field to beyond 65536 which will cause the table to be restructured. Note that if you use a fulltext search index, it should not permit the change (because fulltext is not supported on mediumtext columns only text columns)

Any further questions, please do ask :)
Holder of controversial views, all of which my own.


mermshaus

That is a good answer, thanks a lot. :)

QuoteThis is primarily operating system, not browser, dependent - CRLF is the line ending for Windows, LF is the line ending for Linux and CR is the old line ending for Mac, so HTTP relaxing the requirement is a cross-platform acceptance.

That's what I would have thought, too. But based on the fact that I am using a Linux system, I'd go with the \r\n conversion being a platform-agnostic decision made by browser developers. Unfortunately, I find it rather difficult to search for these kinds of topics as "\n" or "\r\n" or "firefox" and the like are disadvantageous or at least very unspecific search terms. The protocol itself should allow all three kinds of line endings (in "canonical form" apparently not mixed, though) but at least some browser implementations seem to prefer \r\n for some reason.

QuoteIt saves it in almost pure bbcode form. It's complicated. When saving, linebreaks are effectively stripped and converted to br tags.

I assume there are more actions involved because replacing linebreaks by br tags wouldn't have increased the size of my 50 KB (any definition ;)) message (about 1,000 lines) by more than 4-8 KB. But I start to understand the specifics of the architecture and why these issues arise. Tough problem.

Just a thought: It is, of course, possible to determine whether a planned INSERT statement would be truncated by the DBMS. So you could display an error message to the user instead of silently discarding parts of the content/losing data. Sorry, I am certainly not the first one to think of that and I only know 2.0.1. ;) The content of the error message would be interesting: "Although your message technically fits our size limitations, it doesn't fit our technical implementation."

Quote[Explanations regarding practical limits]

Interesting. I guess the JS preview could be disabled for messages of such huge sizes or you could split the payload into multiple packages. Well, such considerations are more or less academic anyway. I know. :)

QuoteNotably if you expand the size to the next step up, certain options for searching become unavailable to you.

I understand. That is probably not an option.

Arantor

Quotebut at least some browser implementations seem to prefer \r\n for some reason.

Probably because for some years the dominant browsers were on Windows platforms. SMF will still be harmonising them all later on.

QuoteI assume there are more actions involved because replacing linebreaks by br tags wouldn't have increased the size of my 50 KB (any definition ) message (about 1,000 lines) by more than 4-8 KB.

Well, it generates an XHTML br tag which is 6 bytes for line breaks, but other characters are entity encoded, namely: ' " < > and & which will generate 6, 6, 4, 4 and 5 byte signatures respectively. Then there's the matter of multi-byte character sets; the limit is a hard 64KB in byte terms not in character terms.

QuoteJust a thought: It is, of course, possible to determine whether a planned INSERT statement would be truncated by the DBMS.

Would it be feasible and practical to perform a query every time a post is made or edited to establish the current physical length of the table? While a relatively cheap query, it's the fact that it's an additional query every post/every edit that soon mounts up.

QuoteSorry, I am certainly not the first one to think of that and I only know 2.0.1.

I know you're also 3 patches behind current 2.0.x release ;)

QuoteThe content of the error message would be interesting: "Although your message technically fits our size limitations, it doesn't fit our technical implementation."

How do you explain that to users?

QuoteI understand. That is probably not an option.

The odds are better than average that it will not have any effect on your forum and its configuration, and there's a chance it will actually encourage better search performance by changing how things are done.
Holder of controversial views, all of which my own.


Advertisement: