BBCode: a philosophical discussion

Started by MrPhil, March 08, 2010, 03:27:30 PM

Previous topic - Next topic

MrPhil

It occurred to me that a great deal of code in SMF is involved with handling Bulletin Board Code (BBCode) in posts and signatures, as well as catching attempts to put HTML into posts (or allowing it for certain privileged users). Naturally, all this code can lead to many bugs and problems.

So, to start a discussion, what exactly is the purpose of BBCode? If there are a few HTML constructs which can be harmful, wouldn't it be easier to allow HTML in general, and just filter out possibly harmful HTML tags? Are there that many harmful HTML tags that it's better to roll your own system (BBCode)? Are there HTML tags that could be allowed, but certain "attributes" need to be screened out?

Do most people just use the editor buttons, or do they actually type out the tags (I'm in the latter group -- I rarely use the buttons)? [ and ] don't require a shift, while < and > do -- is that significant to anyone? XHTML is supposed to use HTML with lowercase tags and certain tags are self-closing -- would that be sufficiently confusing to users that they should not be allowed to directly inject HTML? Like HTML, are most/many BBCode implementations case insensitive? Are BBCode tags so much simpler than HTML that it's worthwhile to use? What is the history of BBCode -- does it actually date back to pre-HTML times (Usenet, etc.)?

Most BBCode tags seem to map pretty much 1:1 with HTML. I suppose that if you wanted to add a "macro" capability to your posts (capabilities not inherent in HTML), you could keep [ ]-style tags for that. How about "modern" HTML practices, where <font> etc. tags are deprecated and the use of CSS is encouraged? BBCode certainly acts more like old-fashioned CSS-less HTML. Is there enough of an advantage to being able to say [font color=red] than having to write <span style="color: red;">? Does the use of BBCode encourage bad HTML-coding habits? How would you best make use of custom CSS definitions if HTML could be used directly?

I'm just curious about what others think of the divergence of BBCode and "best practice" HTML. If you were writing something like SMF from scratch, would you include some kind of BBCode, or would you allow HTML and just screen out selected tags/attributes (or, permit a list of tags and disable the rest)? There doesn't seem to be any official standard for BBCode (unlike, say, W3C), so everyone's implementation is a little different, and somewhat of a subset of HTML capabilities.

Arantor

Purpose of bbcode is two fold.

First up, it is simpler than HTML in a number of ways, and it's relatively sane. You don't ever have to worry about whether a tag has to be closed as such ([br], [hr]). [img] is a lot simpler to use than an <img> is for many users.

Secondly security. The key one is <script>, of course. Not to mention <iframe>. General practice suggests that instead of trying to sanitise or filter - since there will always be Yet Another Exception to filtering, that you approach the process by 'regurgitation', i.e. you go through it, ignore and don't parse the stuff that you don't understand, such as [thistagthatisntarealbbcodetag], and parse what you know is safe.

Going on to your point, the average forum user cares not for CSS, cares not for <font>. They just want to be able to set their writing to be red. They don't - and shouldn't - have to care *how* it becomes red, simply that it is, and for most users I know, [red] or [color=red] is a lot simpler to follow than even the old <font> tag, let alone a span with inline style.

There's no standard for bbcode - perhaps there should be. But that also limits then what a mod author can do (having written a number of interesting tags myself) if you're restricted to a standard.

The other benefit is that as styles and times change - from <font> to <span style="..."> - the bbcode doesn't have to. It's still the same bbcode.

MrPhil

All interesting points. I guess that my two chief objections to a separate BBCode implementation is

1) We (SMF) are constantly getting questions like "How do I add borders to my table? How do I add alt text and a title attribute to this image? Why isn't there a line break tag like <br>? How do I insert an em-dash?" Etc. With straight HTML, you want such goodies? Type 'em in! Not to mention, "How do you use such and such a BBCode tag, since it's so much different than HTML?"

2) There is a massive amount of code in SMF to keep people from using straight HTML, and to provide equivalent services in BBCode. Slow, and subject to bugs and conflicts. My point is that wouldn't it be easier to search for and disable a few forbidden tags (script, iframe, perhaps html, head, body, etc.) than to reimplement the functionality of HTML?

A "simplified" HTML where you don't have to worry about whether or not to close an HTML tag ([hr] rather than <hr> vs. <hr/>), indeed might make life simpler for non-HTML-coders trying to write a post. However, if W3C validation is important, the code might check for /> in HTML or lack of it for certain tags in XHTML. Carried far enough, I suppose it might bring you back to nearly as much code and complexity as BBCode, but hopefully not. The downside to the simplicity of BBCode is that there's always some standard HTML feature or attribute that someone needs, and has to customize their BBCode parser to get.

I suppose that BBCode could be written to just change square brackets [ ] to angle brackets < > (or < />) and there you have a completely compatible, say, img tag, but why not then take the extra step and just let an <img> tag be put in? Could we compromise with a basic [img] tag, for instance, that makes some default attributes, and the self-closing "/" if necessary, but otherwise allows all the <img> tag attributes without having additional code to support them? Of course, it would have to understand enough of what was given for attributes to know what defaults to add, so in the end that might not accomplish much in the way of space/time savings. But the upside is that nobody has to mod the BBCode to get, say, an alt= and a title= into the [img] tag -- just throw them in if you want them. We would allow only lower case tag names, and disable everything else, so it's mostly XHTML-compatible from the start.

Are there that many HTML tags which should be disallowed? Other than <script> and <iframe>, what is there that is potentially dangerous? It just seems that it would be a lot easier and faster to check for a few banned tag names than to replicate most of the function of all the unbanned tags.

As for the average forum user not really caring whether it's <font> or CSS, I can see the point. Perhaps I'm concerned that as an HTML coder, this use of (basically) old-fashioned HTML tags is instilling bad habits in anyone who now (or in the future) finds themselves writing genuine HTML code for pages. It's kind of like teaching introductory programming skills with BASIC, and then complaining that the students' more advanced work is filled with GOTOs and unstructured spaghetti code. Presumably, if you're interested in proper W3C validation of pages, you can write the BBCode-to-(X)HTML translation to use proper CSS rather than deprecated HTML tags.

That brings up a point -- does BBCode protect a user against writing invalid HTML? I don't think it does -- you could just as easily get nested tags out of order in BBCode as you can in direct HTML. How much BBCode support would be needed to detect improperly nested tags, or missing closing tags? How much effort should be made to protect a user from themselves? Should all post input be WYSIWYG, and forbid direct insertion of BBCode tags?

We could preserve square bracket terms [] for SMF-specific "macros" and shortcuts, such as red text via [red] rather than a full <font> or <span>+style tag. This would also allow coders to add custom capability to forums not found in plain HTML (and there would be no need to worry about being restricted to a BBCode standard). The only downside I can think of is that (non-programmer) users could become confused about when to use HTML style and when to use BBCode style.

I suppose there's the possibility that HTML coding standards could change once again in the future, leaving direct HTML tags in the lurch. Even the act of changing a forum from HTML to XHTML or transitional to strict could cause problems for old (existing) posts. Can BBCode be "abstract" enough to be cleanly translated to any level of (X)HTML/CSS, while still simple enough for most people to use?

So, if we grant that it's considered a "good" thing to insulate users from the underlying HTML+CSS,  maybe the discussion should be one of "how do we..."

1) permit the full range of HTML tag attributes (e.g., alt= and title= in an image tag) without tons of additional code that simply "translates" from BBCode to HTML

2) simplify and reduce the code needed to give so much HTML function to BBCode

3) abstract formatting and presentation enough that any reasonable future changes to (X)HTML and CSS can be accommodated without changes to a post

Are the goals incompatible?

Arantor

1. With the exception of the table, you can out of the box.

SMF has [br] and [img] supports parameters including height, width and alt. As for table borders... the reason that isn't available is because of the attempt to auto complete tags is convoluted enough without having to factor in an optional parameter.

em-dash, just copy/paste the character or use the Entity tag mod.

2. No, as MySpace found to their cost, multiple times. No matter how smart a filter implementation is, rest assured it will be broken through some unforeseen fringe case. If in doubt, assume everything is invalid until you prove it isn't.

Don't assume that 'harmless' tags are harmless. Even <a> could be exploited if you didn't sanitise content. Classic case in point: one system I know used to sanitise for links beginning with javascript: and filtering them out. Great... until you realise that many browsers will accept something like java\0script (the \0 being a 'null character') and treat it as the word javascript. Suddenly that good intention of security is now useless.

Basically, it's a toss up between 'exclude everything except what you know is safe' or 'include only what you know is safe' - the latter is dangerously easy to fool.


Some of the other arguments - about <font> versus inline styling. User doesn't care and shouldn't have to care, that's one of the great advantages; it's an instant abstraction system. That way if the admin *does* care, they can have it be XHTML compliant without fear of it not being. Last I checked, SMF 2 at least was using span with inline styling for one-off formatting, which is compliant if not perfect.

I'd argue SMF's bbcode is already a lot of the way there to being that abstract. The bulk of the styling stuff already is anyway (ignoring glow, shadow, marquee all of which I suggested disappear from being built in by default in SMF 2.1). Things like img, iurl, url and anchor are reasonably abstract though they map fairly directly to HTML.


Thing is, I don't actually know many people who want to use alt in an img tag. I've abused its presence in the past, but I can't recall ever actually thinking, "yes, this really should have an alt attribute". That's part of the minefield of user-supplied content; users aren't and shouldn't have to be programmers to be able to do stuff.

When I'm posting, I'm fully aware that img has width, height and alt. Sometimes I've used the first two in order to achieve a specific look or something else, but even though the programmer and 'doing it right' mantra in me says it should, I don't actually care about providing an alt tag. Neither would the average user, and it should be down to the abstraction layer to deal with that for me.


Your three goals are not totally incompatible, but they're not readily compatible either.

1. As I've hinted at above, it's actually fairly rare to need the full range of attributes in HTML. Those who actually need them are probably looking at the wrong platform anyway. Other platforms, such as WordPress, do allow access to the HTML, but generally it's only for the principle author to publish with, just as SMF allows admins to publish with raw HTML. If you have an environment where everyone and anyone can post, you generally don't give them free reign - I don't remember WP allows commentors to have full HTML for example.

2. It's actually fairly simple already compared to the X/HTML spec, IMO. I'm not sure how much simpler it can get without being oversimplified.

3. That's the nature of the beast. You either abstract away the generalities to an intermediate form like bbcode or you have much more hassle to pay later when standards can and do change.

HTML is a long lived standard, and it's pretty ingrained, and pretty abused. XHTML has been a standard for over a decade (yes, it's that old) and yet a lot of places still don't worry so much about it.


I do see where you're coming from, but really a forum is a place for people to communicate. The 'average' user doesn't care two hoots about semantics, or the finer points. They just want it to work, and that's why the WYSIWYG editors are becoming more commonplace, because it abstracts it even more.

In all the time I've been working with forums and posting on them, there are very few times I've ever needed raw HTML, and mostly they concern tables for a very very specific look. I'm not saying it doesn't need improvement, I'm just saying that I think bbcode is the way to go rather than opening the gates to something closer to HTML.

SlammedDime

I'll add only a bit to this... and say I agree with MrPhil...

[copied from a discussion elsewhere]

QuoteI would propose completely ditching BBCode and sticking with plain old HTML.  This has lots of advantages (especially if we decide to drop support for PHP4 (which I really wish we would with the next version), the biggest being PHP's DOM class.  You can easily load up any user's post into a DOM object which will fix most, if not all, of the HTML to be correct, then you can html_entity the entire post to black list everything, and then using a whitelist (our current BBCodes), you can enable the things a user is allowed to have.  The other option is that once you have a DOM object, to cycle through all of the nodes and convert in a single pass using a stack to track objects.
...
It would drastically speed up the load time of posts as you could do this parsing when a post is made, and store it in the database as 'ready to display', then no extra calls to parsebbc are made during load time, and loading a post from the database into an edit field should be as simple as calling htmlspecialchars on it...   
SlammedDime
Former Lead Customizer
BitBucket Projects
GeekStorage.com Hosting
                      My Mods
SimpleSEF
Ajax Quick Reply
Sitemap
more...
                     

Arantor

I still say that trying to blacklist doesn't protect you that far because no matter how clever the blacklisting or filtering is, whitelisting is safer.

I have yet to see something that allows fully user supplied content to use the full range of HTML safely.

The technical implementation would be hugely simplified, sure, but I'd point out that there's a reason you have systems like MediaWiki that don't allow raw HTML by default and instead use wikisyntax (which bracket-linking aside, is not any simpler to use for more advanced formatting) or forums use bbcode.

For over a decade, content systems that allow just-anyone to post content have used other formats.

I can't see the suggestion proposed here being practical, to be honest because I would strongly envisage the whitelist being too generous by accident, or being so limited it's not worth the effort.

SlammedDime

The point is that you blacklist everything (html_entity_encode) and then you reverse items blacklisted that are on the whitelist, or put into a DOM class and cycle through the nodes and entity_encode anything that isn't on the whitelist.  The advantage with a DOM class is that it will fix malformed HTML that is commonly used for exploits and then by only allowing elements whitelisted through and turning everything else into entities.

I would be a fan of the DOM method rather than just entity everything simply because by doing that, you only have to run through the DOM nodes in O(n+m) time (n being the number of nodes, and m being the number of attributes (if any) combined from those nodes, whereas encoding everything, and then running it through a whitelist, you're passing over the entire document in O(n*m) time, where N is the length of the string and m is the number of tags you're whitelisting.
SlammedDime
Former Lead Customizer
BitBucket Projects
GeekStorage.com Hosting
                      My Mods
SimpleSEF
Ajax Quick Reply
Sitemap
more...
                     

Arantor

I realise that, but I'd be concerned that you'll end up with a whitelist that's too generous, and allows things it shouldn't.

SlammedDime

One of the things I plan on writing for my PHP Framework is a DOM based parser that allows HTML... perhaps once you see the implementation, you may change your mind.  If implemented correctly for a forum setting, I think it can be quite effective.  My goal is to write a post that utilizes each and every one of SMF's BBCodes, and then make the same post out of HTML.  Then I will create a DOM parser that validates the HTML similar to how SMF validates the bbcode and benchmark them.  I'm hoping that my implementation will be faster, and just as safe, and I have no problem publishing it or showing the code for anyone to try and break.
SlammedDime
Former Lead Customizer
BitBucket Projects
GeekStorage.com Hosting
                      My Mods
SimpleSEF
Ajax Quick Reply
Sitemap
more...
                     

Arantor

I look forward to seeing it because if it can do as promised without risk of vulnerabilities, I would be interested to see it; parse_bbc is a surprisingly scary monster of a function, though easy enough to extend.

On that note, what about custom bbcode in the manner that SMF provides? Right now I can (and have) done some dynamic things that evaluate on parsing. How would this be accounted for? I'd hate to write some of those out in full HTML when a bbcode shortcut is far simpler, as well as being easier to make permission dependent.

SlammedDime

Quote from: Arantor on April 06, 2010, 06:15:50 AM
I look forward to seeing it because if it can do as promised without risk of vulnerabilities, I would be interested to see it; parse_bbc is a surprisingly scary monster of a function, though easy enough to extend.

On that note, what about custom bbcode in the manner that SMF provides? Right now I can (and have) done some dynamic things that evaluate on parsing. How would this be accounted for? I'd hate to write some of those out in full HTML when a bbcode shortcut is far simpler, as well as being easier to make permission dependent.
That is something that has to be taken into account and I have some ideas floating around, but nothing solid.  And I haven't really played with the DOM class yet.  It very well may be possible to create your own DOM objects that aren't in the DTD, and if that's the case, where you have
[mycustombbcode width=40 height=50]sometext[/mycustombbcode]
it would be very possible to do
<mycustombbcode width="40" height="50">sometext</mycustombbcode>
and then when looping through the DOM nodes, simply do a 'replace' on that node with, (for example)
<img width="40" height="50 src="sometext" />
SlammedDime
Former Lead Customizer
BitBucket Projects
GeekStorage.com Hosting
                      My Mods
SimpleSEF
Ajax Quick Reply
Sitemap
more...
                     

Arantor

Hmm, the idea of a preg_replace_callback (which is what it would have to be, IMO) doesn't fill me with glee to be honest.

SlammedDime

Shouldn't need to do preg_replace at all... in fact I am hoping to avoid PCRE at all costs except maybe for validation of the contents of an attribute or the text inside a node.  Are you understanding what the DOM class is about?  It's purpose is to read in an XML or HTML document and put all <tags> into nodes, which then have attributes, childnodes, and contents... <mycustombbcode> is a tag, so when you come accross that tag, you look it up in the table... does it have a replace?  yes, okay, replace that node with a new node of whatever the replacement is, and filter in the attributes and contents accordingly.  No preg_ anything should be required at all (again, for anything filtering the actual contents of the tag/attributes, not the tag itself)
SlammedDime
Former Lead Customizer
BitBucket Projects
GeekStorage.com Hosting
                      My Mods
SimpleSEF
Ajax Quick Reply
Sitemap
more...
                     

Arantor

Yes, I know the purpose of the DOM class. I don't think you follow what I mean though.

I'm not talking about a literal 1:1 replacement. Let's take the worst example I have in mind, the Aion Syndication mod. It's an ugly mod, to be sure, but it serves as a perfect example of what I mean.

One syntax it supports is [item=123456789]some random text[/item] or something like that. I forget the subtle little details. Anyway you have that, and that's passed over to a function to deal with. It's not just a simple replacement, it requires to be passed to a user function, parameters intact, and the resultant content is a far cry from what it was left with, and certainly cannot be directly computed from the parameters given unlike other tags, such as url and img.

SlammedDime

#14
Ok, here's what I have so far... a full class that can read in HTML content, fix the source to be xhtml compliant, clean up nasty stuff inside of it, remove any javascript:, vbscript: stuff in it, and then spit it back out in an equivilant of an htmlspecialchar'd version... with some modifications.... and all in under 120 lines of code (take away the brackets and function names and actual coding work comes to under 80)  here are some examples...

Input text:
<b test2=4 onclick="javascript:alert(/xss/)">Three<i><img src="test" />test2</i></b>

Output:
Nothing whitelisted (HTML source):
&lt;b test2=&quot;4&quot; onclick=&quot;nojavascript...alert(/xss/)&quot;&gt;Three&lt;i&gt;&lt;img src=&quot;test&quot; /&gt;test2&lt;/i&gt;&lt;/b&gt;

Only <b> is allowed, but cannot have any attributes:
<b>Three&lt;i&gt;&lt;img src=&quot;test&quot; /&gt;test2&lt;/i&gt;</b>

Only <img> is allowed but can only have 'alt' as an attribute:
&lt;b test2=&quot;4&quot; onclick=&quot;nojavascript...alert(/xss/)&quot;&gt;Three&lt;i&gt;<img />test2&lt;/i&gt;&lt;/b&gt;

Now I'm working on advancing the code to allow for some of the same stuff that SMF allows in the bbcode, such as validating attribute contents, before_content, after_content, etc.

Edit: and so far, not one regular expression in sight, except for the basic cleaning of the HTML to remove the general bad stuff... no regex's are used when it comes to parsing the HTML for whitelist/blacklist.
SlammedDime
Former Lead Customizer
BitBucket Projects
GeekStorage.com Hosting
                      My Mods
SimpleSEF
Ajax Quick Reply
Sitemap
more...
                     

Arantor

I'm very curious to see that.

Note that if you want to do validation as powerfully as SMF does, I doubt you'll avoid regex.

Advertisement: