[2.1 RC3] "Automatically link posted URLs" setting does not work

davidhs · February 23, 2021, 03:13:26 PM

Automatically link posted URLs setting (Administration Center > Configuration > Features and Options > Bulletin Board Code) does not work in SMF 2.1 RC3 and previous (until 2.1 Beta 3). In 2.1 Beta 2 works.

In 2.1 Beta 3, regular expresion of URLs without BBC was change, so I suppose the problem is this regular expresion.

In 2.1 RC3 the file is:

Code (Souces\Subs.php (line 2340)) Select

<?php

			if (!empty($modSettings['autoLinkUrls']))
			{
				// Are we inside tags that should be auto linked?
				$no_autolink_area = false;
				if (!empty($open_tags))
				{
					foreach ($open_tags as $open_tag)
						if (in_array($open_tag['tag'], $no_autolink_tags))
							$no_autolink_area = true;
				}

				// Don't go backwards.
				// @todo Don't think is the real solution....
				$lastAutoPos = isset($lastAutoPos) ? $lastAutoPos : 0;
				if ($pos < $lastAutoPos)
					$no_autolink_area = true;
				$lastAutoPos = $pos;

				if (!$no_autolink_area)
				{
					// An &nbsp; right after a URL can break the autolinker
					if (strpos($data, '&nbsp;') !== false)
					{
						$placeholders['<placeholder non-breaking-space>'] = '&nbsp;';
						$data = strtr($data, array('&nbsp;' => '<placeholder non-breaking-space>'));
					}

					// Parse any URLs
					if (!isset($disabled['url']) && strpos($data, '[url') === false)
					{
						// For efficiency, first define the TLD regex in a PCRE subroutine
						$url_regex = '(?(DEFINE)(?<tlds>' . $modSettings['tld_regex'] . '))';

						// Now build the rest of the regex
						$url_regex .=
						// 1. IRI scheme and domain components
						'(?:' .
							// 1a. IRIs with a scheme, or at least an opening "//"
							'(?:' .

								// URI scheme (or lack thereof for schemeless URLs)
								'(?:' .
									// URL scheme and colon
									'\b[a-z][\w\-]+:' .
									// or
									'|' .
									// A boundary followed by two slashes for schemeless URLs
									'(?<=^|\W)(?=//)' .
								')' .

								// IRI "authority" chunk
								'(?:' .
									// 2 slashes for IRIs with an "authority"
									'//' .
									// then a domain name
									'(?:' .
										// Either the reserved "localhost" domain name
										'localhost' .
										// or
										'|' .
										// a run of IRI characters, a dot, and a TLD
										'[\p{L}\p{M}\p{N}\-.:@]+\.(?P>tlds)' .
									')' .
									// followed by a non-domain character or end of line
									'(?=[^\p{L}\p{N}\-.]|$)' .

									// or, if no "authority" per se (e.g. "mailto:" URLs)...
									'|' .

									// a run of IRI characters
									'[\p{L}\p{N}][\p{L}\p{M}\p{N}\-.:@]+[\p{L}\p{M}\p{N}]' .
									// and then a dot and a closing IRI label
									'\.[\p{L}\p{M}\p{N}\-]+' .
								')' .
							')' .

							// Or
							'|' .

							// 1b. Naked domains (e.g. "example.com" in "Go to example.com for an example.")
							'(?:' .
								// Preceded by start of line or a non-domain character
								'(?<=^|[^\p{L}\p{M}\p{N}\-:@])' .
								// A run of Unicode domain name characters (excluding [:@])
								'[\p{L}\p{N}][\p{L}\p{M}\p{N}\-.]+[\p{L}\p{M}\p{N}]' .
								// and then a dot and a valid TLD
								'\.(?P>tlds)' .
								// Followed by either:
								'(?=' .
									// end of line or a non-domain character (excluding [.:@])
									'$|[^\p{L}\p{N}\-]' .
									// or
									'|' .
									// a dot followed by end of line or a non-domain character (excluding [.:@])
									'\.(?=$|[^\p{L}\p{N}\-])' .
								')' .
							')' .
						')' .

						// 2. IRI path, query, and fragment components (if present)
						'(?:' .

							// If any of these parts exist, must start with a single "/"
							'/' .

							// And then optionally:
							'(?:' .
								// One or more of:
								'(?:' .
									// a run of non-space, non-()<>
									'[^\s()<>]+' .
									// or
									'|' .
									// balanced parentheses, up to 2 levels
									'\(([^\s()<>]+|(\([^\s()<>]+\)))*\)' .
								')+' .
								// Ending with:
								'(?:' .
									// balanced parentheses, up to 2 levels
									'\(([^\s()<>]+|(\([^\s()<>]+\)))*\)' .
									// or
									'|' .
									// not a space or one of these punctuation characters
									'[^\s`!()\[\]{};:\'".,<>?«»""''/]' .
									// or
									'|' .
									// a trailing slash (but not two in a row)
									'(?<!/)/' .
								')' .
							')?' .
						')?';

						$data = preg_replace_callback('~' . $url_regex . '~i' . ($context['utf8'] ? 'u' : ''), function($matches)
						{
							$url = array_shift($matches);

							// If this isn't a clean URL, bail out
							if ($url != sanitize_iri($url))
								return $url;

							$scheme = parse_url($url, PHP_URL_SCHEME);

							if ($scheme == 'mailto')
							{
								$email_address = str_replace('mailto:', '', $url);
								if (!isset($disabled['email']) && filter_var($email_address, FILTER_VALIDATE_EMAIL) !== false)
									return '[email=' . $email_address . ']' . $url . '[/email]';
								else
									return $url;
							}

							// Are we linking a schemeless URL or naked domain name (e.g. "example.com")?
							if (empty($scheme))
								$fullUrl = '//' . ltrim($url, ':/');
							else
								$fullUrl = $url;

							// Make sure that $fullUrl really is valid
							if (validate_iri((strpos($fullUrl, '//') === 0 ? 'http:' : '') . $fullUrl) === false)
								return $url;

							return '[url=&quot;' . str_replace(array('[', ']'), array('&#38;#91;', '&#38;#93;'), $fullUrl) . '&quot;]' . $url . '[/url]';
						}, $data);
					}

?>

(line 2463 has UTF-8 characters)

For example, if I write this:

Code Select

http://www.simplemachines.org1/
https://www.simplemachines.org2/
ftp://www.simplemachines.org3/
ftps://www.simplemachines.org4/
[url]http://www.simplemachines.org5/[/url]
[url=http://www.simplemachines.org6/]Home of SMF6[/url]

I see this:

Quote from: 2.1 Beta 2 and previoushttp://www.simplemachines.org1/
https://www.simplemachines.org2/
ftp://www.simplemachines.org3/
ftps://www.simplemachines.org4/
http://www.simplemachines.org5/
Home of SMF6

Quote from: 2.1 RC3http://www.simplemachines.org1/ -- without link
https://www.simplemachines.org2/ -- without link
ftp://www.simplemachines.org3/ -- without link
ftps://www.simplemachines.org4/ -- without link
http://www.simplemachines.org5/
Home of SMF6

shawnb61 · February 23, 2021, 03:21:34 PM

Thanks for the report.

Yes, known issue, it's a dupe of:
https://www.simplemachines.org/community/index.php?topic=576638

And it is up on GitHub as:
https://github.com/SimpleMachines/SMF2.1/issues/6497

shawnb61 · February 23, 2021, 03:32:17 PM

Two layers to the issue, I think...

One is it doesn't appear to like invalid domain extensions. E.g., .org1 is not valid. That is likely a good check - it's not a valid URL.

The other is that it can sometimes get confused when there are encoded special characters at the very end, as in the original bug report. That can fail on valid URLs.

davidhs · February 23, 2021, 04:51:23 PM

Quote from: shawnb61 on February 23, 2021, 03:21:34 PM
Thanks for the report.

Yes, known issue, it's a dupe of:
https://www.simplemachines.org/community/index.php?topic=576638

And it is up on GitHub as:
https://github.com/SimpleMachines/SMF2.1/issues/6497

I am sorry, I search before post it but I did not find.

Quote from: shawnb61 on February 23, 2021, 03:32:17 PM
Two layers to the issue, I think...

One is it doesn't appear to like invalid domain extensions. E.g., .org1 is not valid. That is likely a good check - it's not a valid URL.

The other is that it can sometimes get confused when there are encoded special characters at the very end, as in the original bug report. That can fail on valid URLs.

Again...

I use these not valid URL (really not valid domain) in a test of one mine mod (this mod searches urls in post and writes all at begin or at end of post. Until this moment this worked, so i did not think the problem was the validity of URL.

I test now with

Code Select

http://www.simplemachines.org/
https://www.simplemachines.org/
ftp://www.simplemachines.org/
ftps://www.simplemachines.org/
http://www.simplemachines1.org/
https://www.simplemachines2.org/
ftp://www.simplemachines3.org/
ftps://www.simplemachines4.org/

and works

Quotehttp://www.simplemachines.org/
https://www.simplemachines.org/
ftp://www.simplemachines.org/
ftps://www.simplemachines.org/
http://www.simplemachines1.org/
https://www.simplemachines2.org/
ftp://www.simplemachines3.org/
ftps://www.simplemachines4.org/

(I must modify my test with valid domain!)

But... there is an inconsistency: if I use url/iurl BBC works always (with valid domain and with not valid domain).

shawnb61 · February 23, 2021, 05:21:30 PM

BBCs in general don't do edits on content... You can do all kinds of illogical things with BBC.

But if someone put it there on purpose, SMF shouldn't second-guess them.

OTOH, the auto-link is put there by SMF.

News:

[2.1 RC3] "Automatically link posted URLs" setting does not work

davidhs

shawnb61

shawnb61

davidhs

shawnb61