• Welcome to Simple Machines Community Forum. Please login or sign up.
December 07, 2021, 05:09:13 AM

News:

SMF 2.0.18 has been released! Please update. Read more.


[2.0.12] Search highlight substitution bug breaking HTML tags?

Started by testr, December 27, 2016, 01:55:36 PM

Previous topic - Next topic

testr

Hi SMF, I noticed a bug with search highlighting where broken HTML is being produced. This is happening on 3 different SMF installations that are running 2.0.12, including this board.

Minimal repro steps, plus resulting screenshot attached:

1) Perform a forum search for the terms: AS ON
2) Notice broken text and user-visible partial class/tags in certain thread title text, e.g.
SMF 2.0.x Support / As admin and as user, I get logged out when I switch forums ass="highlight">on my own board
3) View source and notice broken HTML tags being generated:
<str<strong class="highlight">on</strong>g class="highlight">As</str<strong class="highlight">on</strong>g> admin and
<str<strong class="highlight">on</strong>g class="highlight">as</str<strong class="highlight">on</strong>g> user, I get logged
out when I switch forums <strong class="highlight">on</strong> my own board


Would appear to be an issue with the HTML output of the highlighter going through some type of further text replacement, breaking the tags? I couldn't find any posts about this bug, is this a known issue?

Arantor

This is not a new issue; it's been on the (old) tracker for years and no-one had a good way to fix it yet :(
No good deed goes unpunished
All helpful urges should be circumvented

testr

Ah, yes I do see a couple previous threads now, however the issue doesn't seem to have been identified there. Here is the relevant code:

foreach ($context['key_words'] as $query)
{
// Fix the international characters in the keyword too.
$query = strtr($smcFunc['htmlspecialchars']($query), array('\\\'' => '\''));

$body_highlighted = preg_replace_callback('/((<[^>]*)|' . preg_quote(strtr($query, array('\'' => '&#38;#039;')), '/') . ')/i' . ($context['utf8'] ? 'u' : ''), 'search_highlight__preg_callback', $body_highlighted);
$subject_highlighted = preg_replace('/(' . preg_quote($query, '/') . ')/i' . ($context['utf8'] ? 'u' : ''), '<strong class="highlight">$1</strong>', $subject_highlighted);
}


This code is problematic for one because it's looping over the plaintext search terms, and with each one replacing it with HTML, then repeating the process. Clearly if any HTML from an earlier loop matches the text in a later loop, then the HTML will become malformed.

The entire text -> html replacement needs to be done in a single step, it cannot be done in a loop like this and produce proper results. I am not a PHP developer but a search reveals something possibly like this for multi-replace: hxxp:stackoverflow.com/a/535146 [nonactive]

testr

Actually I think str_replace is also the wrong thing to use, probably need something like strtr to do multiple replacements in one step: hxxp:php.net/strtr [nonactive]

Anyways, I bring this up because the current behavior seems to me like it could potentially contain obscure XSS vulnerabilities on account of the broken tags, especially since search accepts GET requests and is not protected by CSRF tokens.

live627

Bug amnesia? I was unaware of this odd behavior.

strtrr() will definitely fix this.

Arantor

It's already using strtr, except for the bits that aren't case sensitive for which strtr doesn't help you... which is why we didn't fix this pre 2.0 final, we certainly knew about it pre 2.0 final. The problem is that it has to run the replacements in parallel and there's no version of strtr that is case insensitive.

Doesn't look like it got onto Mantis, which is weird, but a quick search yielded http://www.simplemachines.org/community/index.php?topic=413837.0 as a report in 2010.
No good deed goes unpunished
All helpful urges should be circumvented

testr

The bug seems to only be manifesting in matching thread subject titles, not in matching post body. Are there different case-sensitivity or other requirements between those two?

I am not smart enough to understand the regexes used in that part of the code, but there definitely seems to be a significantly differing codepath between the two cases. Body (line 2067) is using the search_highlight__preg_callback func as a regex callback and using strtr, for example, whereas subject (line 2068) is just using the simple preg_replace.

Do you understand why the body highlighting code isn't suffering from this same issue? I can't really figure it out.

I was going to suggest an alternate way of addressing the problem, but if the issue could be solved with all existing code it'd probably be for the best.

live627

Is the fix staring right at us?

Code (Find) Select
$subject_highlighted = preg_replace('/(' . preg_quote($query, '/') . ')/i' . ($context['utf8'] ? 'u' : ''), '<strong class="highlight">$1</strong>', $subject_highlighted);

Code (Replace with) Select
$subject_highlighted = preg_replace_callback('/((<[^>]*)|' . preg_quote(strtr($query, array('\'' => '&#039;')), '/') . ')/i' . ($context['utf8'] ? 'u' : ''), 'search_highlight__preg_callback', $subject_highlighted);
//~ $subject_highlighted = preg_replace('/(' . preg_quote($query, '/') . ')/i' . ($context['utf8'] ? 'u' : ''), '<strong class="highlight">$1</strong>', $subject_highlighted);


It's just the same code the body uses... :o

Arantor

No good deed goes unpunished
All helpful urges should be circumvented

live627


Arantor

Same as the OP described, the find/replace stuff where it replaces the HTML that has been replaced already, e.g. a search that includes the word class as a second or third word of the string being searched for, so class will be inserted by the first word match and then broken by the second or third.
No good deed goes unpunished
All helpful urges should be circumvented

live627


Arantor

No good deed goes unpunished
All helpful urges should be circumvented

Arantor

OK, so there's some weirdness that's definitely not the kind you're thinking of and it does partially but not completely fix the problem.

In all cases, the post is literally just the words 'strong on ron class lass'

Capture1 - only 'as' is highlighted when I think I'd expect other things to be highlighted.

Capture2 - search for 'class lass class' yields the visible expected result... but...

Capture3 - the markup shows nested strong tags.

Capture4 - more weirdness.

Now you might argue 'contrived example' and it is but it still demonstrates there is something wrong here. Just not the bug there used to be.
No good deed goes unpunished
All helpful urges should be circumvented

live627


live627

This code probably fixes the issue. Seems to work ok in my tests...

Code (Fnd) Select
foreach ($context['key_words'] as $query)
{
// Fix the international characters in the keyword too.
$query = strtr($smcFunc['htmlspecialchars']($query), array('\\\'' => '\''));

$body_highlighted = preg_replace_callback('/((<[^>]*)|' . preg_quote(strtr($query, array('\'' => '&#039;')), '/') . ')/i' . ($context['utf8'] ? 'u' : ''), 'search_highlight__preg_callback', $body_highlighted);
$subject_highlighted = preg_replace('/(' . preg_quote($query, '/') . ')/i' . ($context['utf8'] ? 'u' : ''), '<strong class="highlight">$1</strong>', $subject_highlighted);
}


Code (Replace with) Select

$body_highlighted = highlight($message['body'], $context['key_words']);
$subject_highlighted = highlight($message['subject'], $context['key_words']);


Code (Append to file) Select
/**
* Highlighting matching string
*
* @param string $text Text to search through
* @param string $words List of keywords to search
*
* @return string Text with highlighted keywords
*/
function highlight($text, $words)
{
$words = implode('|', array_map('preg_quote', $words));
$highlighted = preg_filter('/' . $words . '/i', '<span class="highlight">$0</span>', $text);
if (!empty($highlighted))
{
$text = $highlighted;
}

return $text;
}

Advertisement: