[4623] Auto-linking of some URLs is broken

nathanb · April 22, 2010, 06:22:28 PM

http://www.facebook.com/#!/video/video.php?v=386443877426&ref=mf

Everything after the # isn't linked.

Thanks!
Nathan

Arantor · April 22, 2010, 06:23:52 PM

! isn't supposed to be in URLs from what I remember of the URL format.

Blame Facebook for not following HTTP as a standard.

nathanb · April 22, 2010, 09:30:05 PM

Read section 2.3 of rfc 2396, Unreserved Characters. An exclamation point is legal in a URL. Blame SMF for failing to parse a valid URL correctly

Arantor · April 22, 2010, 09:36:05 PM

OK, just rechecked, and yes it's not got a reserved status.

However, there is actually a very good reason I can think of why it isn't included.

I've seen this before where people post, say, a YouTube link with exclamation marks like so:
You gottta see this it's awesome http://www.youtube.com/?v=somevideoid!!!!!!!!!!!!!!!!!!

And boom, now it's broken if it honours it.

Lesser of two evils, really.

nathanb · April 22, 2010, 10:17:30 PM

Wait, so violating the URI-parsing spec is, in your eyes, less evil than doing the right thing? As in, people who post a valid URL will have it break, and that's OK with you?

Kids these days...

At least it shouldn't be hard to patch on my own board.

Norv · April 23, 2010, 12:34:51 AM

Please, keep a friendly tone in what you'd like to say, or do not say it.

You may want to note that actually, rfc 2396 is obsolete, superceded by rfc 3986: http://tools.ietf.org/html/rfc3986

We will look into the issue. Thank you for the report!

nathanb · April 23, 2010, 01:09:22 AM

Quote from: Norv on April 23, 2010, 12:34:51 AM
You may want to note that actually, rfc 2396 is obsolete, superceded by rfc 3986: http://tools.ietf.org/html/rfc3986

That is true...in rfc3986 the exclamation point is a delimiter character, which means that it is a legal URI character with special meaning. It should be percent-encoded if it is not being used in its special context.

While an argument could be made either way (that Facebook is or is not using the ! character in the context it was intended to be used), it seems reasonable to assert that keeping Facebook URLs from being broken is a worthwhile feature.

Arantor · April 23, 2010, 05:29:54 AM

I'm not here to debate whether violating spec is the right thing to do, there are so many violations of specification it isn't funny (Apache doesn't implement all of HTML 1.1 for example)

I'm just giving you a reason why that might be the case that the regex doesn't include a !. (Oh, and btw, I stopped being a kid when I turned 18 -- 8 years ago

)

Norv · July 18, 2010, 08:02:23 AM

I am not certain what to do with this report. I can see arguments both ways.

Do Facebook URLs really (or, still) look this way? (I don't use Facebook)

nathanb · July 18, 2010, 09:14:29 AM

Quote from: Norv on July 18, 2010, 08:02:23 AM
Do Facebook URLs really (or, still) look this way? (I don't use Facebook)

They do indeed.

See for example here or here (on that second one the post put explicit url tags around the whole link to keep SMF from breaking it).

Joshua Dickerson · November 22, 2010, 10:14:16 PM

We should follow the standard. If a character is allowed to be in a URL without escaping, we should allow it. Even if Facebook or any other site uses it often.

Norv · November 22, 2010, 11:57:33 PM

Yes.
This should be documented as currently unsupported in RC4.

nathanb · February 10, 2011, 11:24:55 AM

FWIW, Tim Bray wrote a post on this topic. It appears that the hashbang is actually being used appropriately in these URLs, meaning that not only does autolinking these urls make sense from a practicality perspective but a spec perspective as well.

Arantor · February 10, 2011, 11:32:41 AM

It's used appropriately because Google says it's appropriate, rather than the spec saying it, and the entire article is actually slanted as pointing out how bad an idea it is, but I can see where you're coming from.

I still think it's a bad idea to include it, but I guess that there's not really any alternative (see my example above of what is likely to happen to less technical users... fortunately, they at least have the url bbcode button that might avoid the worst of the hassles for that group of users)

Joshua Dickerson · February 10, 2011, 03:15:58 PM

Tracked: http://dev.simplemachines.org/mantis/view.php?id=4623

Joshua Dickerson · February 22, 2011, 02:27:45 PM

Apparently, these characters are reserved as sub-delimiters. So, they should be allowed in URI without percent encoding. It is already tracked, but the developer fixing this bug should take note of all of the delimiters and sub-delimiters before resolving it.

Quote2.2. Reserved Characters

URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.

Berners-Lee, et al. Standards Track [Page 12]

RFC 3986 URI Generic Syntax January 2005

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI is
interpreted by most applications. Thus, characters in the reserved
set are protected from normalization and are therefore safe to be
used by scheme-specific and producer-specific algorithms for
delimiting data subcomponents within a URI.

A subset of the reserved characters (gen-delims) is used as
delimiters of the generic URI components described in Section 3. A
component's ABNF syntax rule will not use the reserved or gen-delims
rule names directly; instead, each syntax rule lists the characters
allowed within that component (i.e., not delimiting it), and any of
those characters that are also in the reserved set are "reserved" for
use as subcomponent delimiters within the component. Only the most
common subcomponents are defined by this specification; other
subcomponents may be defined by a URI scheme's specification, or by
the implementation-specific syntax of a URI's dereferencing
algorithm, provided that such subcomponents are delimited by
characters in the reserved set allowed within that component.

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

Aleksi "Lex" Kilpinen · February 23, 2011, 03:08:33 AM

Related note, scandics like ä and ö will break an URL in SMF, although they are actually valid. For example http://www.tänään.fi/ is a valid URL and a working website.

( ADDED: See, http://dev.simplemachines.org/mantis/view.php?id=4623 Mantis understands it, SMF doesn't. )

Joshua Dickerson · February 23, 2011, 03:45:39 AM

Two URLs with regular expressions we might be able to use: http://daringfireball.net/2010/07/improved_regex_for_matching_urls and http://www.mattfarina.com/2009/01/08/rfc-3986-url-validation

Norv, according the aforementioned links (sorry, I really wanted to use a big word), RFC 1738 is the correct one for URLs. RFC 2986 matches all URIs.

LexArma, from my understanding, and someone please correct me if I am wrong, those characters are invalid according to the cited RFCs. Either there is another RFC which I didn't find in my searching or browsers are doing a workaround like is done for Asian characters.

AH! I just read further down that blog post (Matt Farina's) and found out about IRIs which are outlined in RFC 3987 and broken down in a Wikipedia article. Also there is Internationalized Domain Names. IRIs are not a spec yet, but I guess we should get ready for them anyway.

I'd have to do more searching for IRI regular expressions to match that. I don't think there are many.

Aleksi "Lex" Kilpinen · February 23, 2011, 03:58:41 AM

Quote
A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs 3490, 3491, 3492 and 3454, and is based on Unicode 3.2. One refers to this using the term Internationalized Domain Name or IDN.

http://www.faqs.org/rfcs/rfc3490.html
http://www.faqs.org/rfcs/rfc3491.html
http://www.faqs.org/rfcs/rfc3492.html
http://www.faqs.org/rfcs/rfc3454.html

Illori · November 18, 2011, 08:27:18 AM

seems fixed in 2.0

News:

[4623] Auto-linking of some URLs is broken