Simple Machines Community Forum

SMF Development => Bug Reports => Fixed or Bogus Bugs => Topic started by: nathanb on April 22, 2010, 06:22:28 PM

Title: [4623] Auto-linking of some URLs is broken
Post by: nathanb on April 22, 2010, 06:22:28 PM
http://www.facebook.com/#!/video/video.php?v=386443877426&ref=mf

Everything after the # isn't linked.

Thanks!
Nathan
Title: Re: Auto-linking of some URLs is broken
Post by: Arantor on April 22, 2010, 06:23:52 PM
! isn't supposed to be in URLs from what I remember of the URL format.

Blame Facebook for not following HTTP as a standard.
Title: Re: Auto-linking of some URLs is broken
Post by: nathanb on April 22, 2010, 09:30:05 PM
Read section 2.3 of rfc 2396, Unreserved Characters. An exclamation point is legal in a URL. Blame SMF for failing to parse a valid URL correctly :)
Title: Re: Auto-linking of some URLs is broken
Post by: Arantor on April 22, 2010, 09:36:05 PM
OK, just rechecked, and yes it's not got a reserved status.

However, there is actually a very good reason I can think of why it isn't included.

I've seen this before where people post, say, a YouTube link with exclamation marks like so:
You gottta see this it's awesome http://www.youtube.com/?v=somevideoid!!!!!!!!!!!!!!!!!!

And boom, now it's broken if it honours it.

Lesser of two evils, really.
Title: Re: Auto-linking of some URLs is broken
Post by: nathanb on April 22, 2010, 10:17:30 PM
Wait, so violating the URI-parsing spec is, in your eyes, less evil than doing the right thing? As in, people who post a valid URL will have it break, and that's OK with you?

Kids these days...

At least it shouldn't be hard to patch on my own board.
Title: Re: Auto-linking of some URLs is broken
Post by: Norv on April 23, 2010, 12:34:51 AM
Please, keep a friendly tone in what you'd like to say, or do not say it.

You may want to note that actually, rfc 2396 is obsolete, superceded by rfc 3986: http://tools.ietf.org/html/rfc3986

We will look into the issue. Thank you for the report!
Title: Re: Auto-linking of some URLs is broken
Post by: nathanb on April 23, 2010, 01:09:22 AM
Quote from: Norv on April 23, 2010, 12:34:51 AM
You may want to note that actually, rfc 2396 is obsolete, superceded by rfc 3986: http://tools.ietf.org/html/rfc3986
That is true...in rfc3986 the exclamation point is a delimiter character, which means that it is a legal URI character with special meaning. It should be percent-encoded if it is not being used in its special context.

While an argument could be made either way (that Facebook is or is not using the ! character in the context it was intended to be used), it seems reasonable to assert that keeping Facebook URLs from being broken is a worthwhile feature.
Title: Re: Auto-linking of some URLs is broken
Post by: Arantor on April 23, 2010, 05:29:54 AM
I'm not here to debate whether violating spec is the right thing to do, there are so many violations of specification it isn't funny (Apache doesn't implement all of HTML 1.1 for example)

I'm just giving you a reason why that might be the case that the regex doesn't include a !. (Oh, and btw, I stopped being a kid when I turned 18 -- 8 years ago :P)
Title: Re: Auto-linking of some URLs is broken
Post by: Norv on July 18, 2010, 08:02:23 AM
I am not certain what to do with this report. I can see arguments both ways.

Do Facebook URLs really (or, still) look this way? (I don't use Facebook)
Title: Re: Auto-linking of some URLs is broken
Post by: nathanb on July 18, 2010, 09:14:29 AM
Quote from: Norv on July 18, 2010, 08:02:23 AM
Do Facebook URLs really (or, still) look this way? (I don't use Facebook)
They do indeed.

See for example here (http://www.thephorum.net/index.php/topic,4295.msg88267.html#msg88267) or here (http://www.thephorum.net/index.php/topic,4295.msg88236.html#msg88236) (on that second one the post put explicit url tags around the whole link to keep SMF from breaking it).
Title: Re: Auto-linking of some URLs is broken
Post by: Joshua Dickerson on November 22, 2010, 10:14:16 PM
We should follow the standard. If a character is allowed to be in a URL without escaping, we should allow it. Even if Facebook or any other site uses it often.
Title: Re: Auto-linking of some URLs is broken
Post by: Norv on November 22, 2010, 11:57:33 PM
Yes.
This should be documented as currently unsupported in RC4.
Title: Re: Auto-linking of some URLs is broken
Post by: nathanb on February 10, 2011, 11:24:55 AM
FWIW, Tim Bray wrote a post on this topic (http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch). It appears that the hashbang is actually being used appropriately in these URLs, meaning that not only does autolinking these urls make sense from a practicality perspective but a spec perspective as well.
Title: Re: Auto-linking of some URLs is broken
Post by: Arantor on February 10, 2011, 11:32:41 AM
It's used appropriately because Google says it's appropriate, rather than the spec saying it, and the entire article is actually slanted as pointing out how bad an idea it is, but I can see where you're coming from.

I still think it's a bad idea to include it, but I guess that there's not really any alternative (see my example above of what is likely to happen to less technical users... fortunately, they at least have the url bbcode button that might avoid the worst of the hassles for that group of users)
Title: Re: Auto-linking of some URLs is broken
Post by: Joshua Dickerson on February 10, 2011, 03:15:58 PM
Tracked: http://dev.simplemachines.org/mantis/view.php?id=4623
Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Joshua Dickerson on February 22, 2011, 02:27:45 PM
Apparently, these characters are reserved as sub-delimiters. So, they should be allowed in URI without percent encoding. It is already tracked, but the developer fixing this bug should take note of all of the delimiters and sub-delimiters before resolving it.
Quote2.2.  Reserved Characters

   URIs include components and subcomponents that are delimited by
   characters in the "reserved" set.  These characters are called
   "reserved" because they may (or may not) be defined as delimiters by
   the generic syntax, by each scheme-specific syntax, or by the
   implementation-specific syntax of a URI's dereferencing algorithm.
   If data for a URI component would conflict with a reserved
   character's purpose as a delimiter, then the conflicting data must be
   percent-encoded before the URI is formed.








Berners-Lee, et al.         Standards Track                    [Page 12]

RFC 3986                   URI Generic Syntax               January 2005


      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.

   A subset of the reserved characters (gen-delims) is used as
   delimiters of the generic URI components described in Section 3.  A
   component's ABNF syntax rule will not use the reserved or gen-delims
   rule names directly; instead, each syntax rule lists the characters
   allowed within that component (i.e., not delimiting it), and any of
   those characters that are also in the reserved set are "reserved" for
   use as subcomponent delimiters within the component.  Only the most
   common subcomponents are defined by this specification; other
   subcomponents may be defined by a URI scheme's specification, or by
   the implementation-specific syntax of a URI's dereferencing
   algorithm, provided that such subcomponents are delimited by
   characters in the reserved set allowed within that component.

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.  If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.
Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Aleksi "Lex" Kilpinen on February 23, 2011, 03:08:33 AM
Related note, scandics like ä and ö will break an URL in SMF, although they are actually valid. For example http://www.tänään.fi/ is a valid URL and a working website.

( ADDED: See, http://dev.simplemachines.org/mantis/view.php?id=4623 Mantis understands it, SMF doesn't. )
Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Joshua Dickerson on February 23, 2011, 03:45:39 AM
Two URLs with regular expressions we might be able to use: http://daringfireball.net/2010/07/improved_regex_for_matching_urls and http://www.mattfarina.com/2009/01/08/rfc-3986-url-validation

Norv, according the aforementioned links (sorry, I really wanted to use a big word), RFC 1738 (http://tools.ietf.org/html/rfc1738) is the correct one for URLs. RFC 2986 matches all URIs.

LexArma, from my understanding, and someone please correct me if I am wrong, those characters are invalid according to the cited RFCs. Either there is another RFC which I didn't find in my searching or browsers are doing a workaround like is done for Asian characters.

AH! I just read further down that blog post (Matt Farina's) and found out about IRIs which are outlined in RFC 3987 (http://tools.ietf.org/html/rfc3987) and broken down in a Wikipedia article (http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier). Also there is Internationalized Domain Names (http://en.wikipedia.org/wiki/Internationalized_domain_name). IRIs are not a spec yet, but I guess we should get ready for them anyway.

I'd have to do more searching for IRI regular expressions to match that. I don't think there are many.
Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Aleksi "Lex" Kilpinen on February 23, 2011, 03:58:41 AM
Quote
A standard approach to dealing with multilingual domain names was agreed by the IETF in March 2003. It is defined in RFCs 3490, 3491, 3492 and 3454, and is based on Unicode 3.2. One refers to this using the term Internationalized Domain Name or IDN.

http://www.faqs.org/rfcs/rfc3490.html
http://www.faqs.org/rfcs/rfc3491.html
http://www.faqs.org/rfcs/rfc3492.html
http://www.faqs.org/rfcs/rfc3454.html

Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Illori on November 18, 2011, 08:27:18 AM
seems fixed in 2.0
Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Aleksi "Lex" Kilpinen on November 18, 2011, 08:56:41 AM
Scandics are still broken though...

http://www.ääkkösiä.fi/
Title: Re: [4623] Auto-linking of some URLs is broken
Post by: Illori on November 18, 2011, 08:58:58 AM
maybe worth opening a separate report for that one... or adding it to one of the other url bugs.