News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

[4623] Auto-linking of some URLs is broken

Started by nathanb, April 22, 2010, 06:22:28 PM

Previous topic - Next topic

nathanb

http://www.facebook.com/#!/video/video.php?v=386443877426&ref=mf

Everything after the # isn't linked.

Thanks!
Nathan

Arantor

! isn't supposed to be in URLs from what I remember of the URL format.

Blame Facebook for not following HTTP as a standard.
Holder of controversial views, all of which my own.


nathanb

Read section 2.3 of rfc 2396, Unreserved Characters. An exclamation point is legal in a URL. Blame SMF for failing to parse a valid URL correctly :)

Arantor

OK, just rechecked, and yes it's not got a reserved status.

However, there is actually a very good reason I can think of why it isn't included.

I've seen this before where people post, say, a YouTube link with exclamation marks like so:
You gottta see this it's awesome http://www.youtube.com/?v=somevideoid!!!!!!!!!!!!!!!!!!

And boom, now it's broken if it honours it.

Lesser of two evils, really.
Holder of controversial views, all of which my own.


nathanb

Wait, so violating the URI-parsing spec is, in your eyes, less evil than doing the right thing? As in, people who post a valid URL will have it break, and that's OK with you?

Kids these days...

At least it shouldn't be hard to patch on my own board.

Norv

Please, keep a friendly tone in what you'd like to say, or do not say it.

You may want to note that actually, rfc 2396 is obsolete, superceded by rfc 3986: http://tools.ietf.org/html/rfc3986

We will look into the issue. Thank you for the report!
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

nathanb

Quote from: Norv on April 23, 2010, 12:34:51 AM
You may want to note that actually, rfc 2396 is obsolete, superceded by rfc 3986: http://tools.ietf.org/html/rfc3986
That is true...in rfc3986 the exclamation point is a delimiter character, which means that it is a legal URI character with special meaning. It should be percent-encoded if it is not being used in its special context.

While an argument could be made either way (that Facebook is or is not using the ! character in the context it was intended to be used), it seems reasonable to assert that keeping Facebook URLs from being broken is a worthwhile feature.

Arantor

I'm not here to debate whether violating spec is the right thing to do, there are so many violations of specification it isn't funny (Apache doesn't implement all of HTML 1.1 for example)

I'm just giving you a reason why that might be the case that the regex doesn't include a !. (Oh, and btw, I stopped being a kid when I turned 18 -- 8 years ago :P)
Holder of controversial views, all of which my own.


Norv

I am not certain what to do with this report. I can see arguments both ways.

Do Facebook URLs really (or, still) look this way? (I don't use Facebook)
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

nathanb

Quote from: Norv on July 18, 2010, 08:02:23 AM
Do Facebook URLs really (or, still) look this way? (I don't use Facebook)
They do indeed.

See for example here or here (on that second one the post put explicit url tags around the whole link to keep SMF from breaking it).

Joshua Dickerson

We should follow the standard. If a character is allowed to be in a URL without escaping, we should allow it. Even if Facebook or any other site uses it often.
Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

Norv

Yes.
This should be documented as currently unsupported in RC4.
To-do lists are for deferral. The more things you write down the later they're done... until you have 100s of lists of things you don't do.

File a security report | Developers' Blog | Bug Tracker


Also known as Norv on D* | Norv N. on G+ | Norv on Github

nathanb

FWIW, Tim Bray wrote a post on this topic. It appears that the hashbang is actually being used appropriately in these URLs, meaning that not only does autolinking these urls make sense from a practicality perspective but a spec perspective as well.

Arantor

It's used appropriately because Google says it's appropriate, rather than the spec saying it, and the entire article is actually slanted as pointing out how bad an idea it is, but I can see where you're coming from.

I still think it's a bad idea to include it, but I guess that there's not really any alternative (see my example above of what is likely to happen to less technical users... fortunately, they at least have the url bbcode button that might avoid the worst of the hassles for that group of users)
Holder of controversial views, all of which my own.


Joshua Dickerson

Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

Joshua Dickerson

Apparently, these characters are reserved as sub-delimiters. So, they should be allowed in URI without percent encoding. It is already tracked, but the developer fixing this bug should take note of all of the delimiters and sub-delimiters before resolving it.
Quote2.2.  Reserved Characters

   URIs include components and subcomponents that are delimited by
   characters in the "reserved" set.  These characters are called
   "reserved" because they may (or may not) be defined as delimiters by
   the generic syntax, by each scheme-specific syntax, or by the
   implementation-specific syntax of a URI's dereferencing algorithm.
   If data for a URI component would conflict with a reserved
   character's purpose as a delimiter, then the conflicting data must be
   percent-encoded before the URI is formed.








Berners-Lee, et al.         Standards Track                    [Page 12]

RFC 3986                   URI Generic Syntax               January 2005


      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.

   A subset of the reserved characters (gen-delims) is used as
   delimiters of the generic URI components described in Section 3.  A
   component's ABNF syntax rule will not use the reserved or gen-delims
   rule names directly; instead, each syntax rule lists the characters
   allowed within that component (i.e., not delimiting it), and any of
   those characters that are also in the reserved set are "reserved" for
   use as subcomponent delimiters within the component.  Only the most
   common subcomponents are defined by this specification; other
   subcomponents may be defined by a URI scheme's specification, or by
   the implementation-specific syntax of a URI's dereferencing
   algorithm, provided that such subcomponents are delimited by
   characters in the reserved set allowed within that component.

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.  If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.
Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

Aleksi "Lex" Kilpinen

#16
Related note, scandics like ä and ö will break an URL in SMF, although they are actually valid. For example http://www.tänään.fi/ is a valid URL and a working website.

( ADDED: See, http://dev.simplemachines.org/mantis/view.php?id=4623 Mantis understands it, SMF doesn't. )
Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

Joshua Dickerson

Two URLs with regular expressions we might be able to use: http://daringfireball.net/2010/07/improved_regex_for_matching_urls and http://www.mattfarina.com/2009/01/08/rfc-3986-url-validation

Norv, according the aforementioned links (sorry, I really wanted to use a big word), RFC 1738 is the correct one for URLs. RFC 2986 matches all URIs.

LexArma, from my understanding, and someone please correct me if I am wrong, those characters are invalid according to the cited RFCs. Either there is another RFC which I didn't find in my searching or browsers are doing a workaround like is done for Asian characters.

AH! I just read further down that blog post (Matt Farina's) and found out about IRIs which are outlined in RFC 3987 and broken down in a Wikipedia article. Also there is Internationalized Domain Names. IRIs are not a spec yet, but I guess we should get ready for them anyway.

I'd have to do more searching for IRI regular expressions to match that. I don't think there are many.
Come work with me at Promenade Group



Need help? See the wiki. Want to help SMF? See the wiki!

Did you know you can help develop SMF? See us on Github.

How have you bettered the world today?

Aleksi "Lex" Kilpinen

Quote
A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs 3490, 3491, 3492 and 3454, and is based on Unicode 3.2. One refers to this using the term Internationalized Domain Name or IDN.

http://www.faqs.org/rfcs/rfc3490.html
http://www.faqs.org/rfcs/rfc3491.html
http://www.faqs.org/rfcs/rfc3492.html
http://www.faqs.org/rfcs/rfc3454.html

Slava
Ukraini!
"Before you allow people access to your forum, especially in an administrative position, you must be aware that that person can seriously damage your forum. Therefore, you should only allow people that you trust, implicitly, to have such access." -Douglas

How you can help SMF

Illori


Advertisement: