2 character minimum in search

Started by GunSlingerBoy, June 23, 2012, 07:34:29 PM

Previous topic - Next topic

Arantor

In which case, as per your bug report, you should ignore the results until you retest it on your own forum. As already noted, this forum runs Sphinx - an option not readily available - and I know from experience that it has an atypical configuration.

I'm not sure which API of Sphinx it's using, for instance, and one of those APIs may even be the unreliable one that even Sphinx itself discontinued after realising how buggy it was.
Holder of controversial views, all of which my own.

Douglas

My comments from another thread, and reasons why we need to change the search params to a "total number of characters minimum", rather than eliminating single characters.

QuoteTrying to find out why when I do an "include" in index.template.php, it's returning a one.  I'm searching "include returns a 1" and I get hit with this:

Each word must be at least two characters long.

We may want to drop this restriction, because sometimes there ARE things people are searching for that have single characters.

Now, I can understand (and fully support) if there's a minimum number of characters (say... 3), but don't limit the word search to two or more characters, please. :)

EDIT: Okay, so I did include returns "1" and got the same thing... so now I'm up sh*t's creek with this restriction.
Doug Hazard
* Full Stack (Web) Developer for The Catholic Diocese of Richmond
(20+ Diocesan sites, 130+ Church sites & 24 School sites)
* HBCUAC.org Web Developer, the NAIA's only HBCU Athletic Conference
* Former Sports Photographer and Media Personality and Former CFB Historian
* Tech Admin for one 2.9M+ post and one 11.6M+ post sites. Used to own a 1M+ post site.
* WordPress Developer (Junkie / Guru / Maven / whatever)

kat


Arantor

Even though you can still remove the restriction and STILL not find what you were searching for, only now you get dozens o f unrelated, irrelevant matches, because that's obviously a better use of your time. ::)
Holder of controversial views, all of which my own.

MrPhil

If you restrict each search TERM (one word OR a "-enclosed phrase) to a minimum of 2 or 3 characters, that would make everyone happy. You could still have single letters, e.g., "include returns a 1" and only that phrase should be returned. Is there some reason SMF is breaking down phrases into single words and looking for them individually? If there is, only if all words are found, in order, should it be considered a hit. In the database, each post should be one long string, so I don't see why we can't search for a phrase at a time. On the other hand, if we're actually looking at some index with only single words, SMF needs to then check whether they all appear in the right order.

Arantor

Honestly did you not read what I posted?

Consider this site. It has 3 million posts. Its database must run into the gigabytes. Do you seriously intend that SMF should go through even 1GB of data to perform a search? Just consider how long that would take to perform, and how much hurt that's going to put on the server.

The ONLY solution is to build an index of the words, much smaller and faster to traverse to give you meaningful results.

In any case you can't even rely on your treasured exact-match to work. It only requires someone to put a double space in to the original and it's never going to work. (SMF converts all double spaces into space followed by an nbsp entity)
Holder of controversial views, all of which my own.

emanuele

You know...the best way to understand if something is what people really want is give them the tool and let figure it out by them selves.

That should just strip out from the search the "single-char words" and give a notice about the letters removed.

Search.php
Code (find) Select
// Trim everything and make sure there are no words that are the same.
Code (add before) Select
$context['search_ignored'] = array();


Code (find) Select
// Don't allow very, very short words.
elseif ($smcFunc['strlen']($value) < 2)
{
$context['search_errors']['search_string_small_words'] = true;
unset($searchArray[$index]);
}

Code (replace with) Select
// Don't allow very, very short words.
elseif ($smcFunc['strlen']($value) < 2)
{
$context['search_ignored'][] = $value;
unset($searchArray[$index]);
}



Search.template.php
The css will not work for sure because it comes from 2.1, the markup...no idea
Code (find) Select
function template_results()
{
global $context, $settings, $options, $txt, $scripturl, $message;

Code (add after) Select
if (!empty($context['search_ignored']))
echo '
<div id="search_results">
<div class="cat_bar">
<h3 class="catbg">
', $txt['generic_warning'], '
</h3>
</div>
<span class="upperframe"><span></span></span>
<div class="roundframe">
<p class="noticebox">', $txt['search_warning_ignored_word' . (count($context['search_ignored']) == 1 ? '' : 's')], ': ', implode(', ', $context['search_ignored']), '</p>
</div>
<span class="lowerframe"><span></span></span>
</div><br />';



Code (find) Select
if (!empty($context['search_errors']))
echo '
<p class="errorbox">', implode('<br />', $context['search_errors']['messages']), '</p>';

Code (add before) Select
if (!empty($context['search_ignored']))
echo '
<p class="noticebox">', $txt['search_warning_ignored_word' . (count($context['search_ignored']) == 1 ? '' : 's')], ': ', implode(', ', $context['search_ignored']), '</p>';



Search.english.php
Code (find) Select
$txt['search_adjust_query'] = 'Adjust Search Parameters';
Code (add after) Select
$txt['search_warning_ignored_word'] = 'This term has been ignored in your search';
$txt['search_warning_ignored_words'] = 'These terms have been ignored in your search';


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Elmacik

How can we say that search engines are not clever enough to match "doesn't" to "does not"? They can do it pretty well and much more of it. I say that instead of warning a puke at the users face; doing like other search engines and dropping that chars is better when possible.
Home of Elmacik

Kindred

ummmm....   elmacik,

When one of our forums has resources like Google to catalog the searches and run against "exact match", then go for it...   until then, a single search like that on my forum would likely kill the server. As for matching doesn't to does not...   that would, of course depend upon cataloging and matching EVERY abbreviation/contraction with the non-abbreviation/contraction.


So, which would you rather have
-- a search that tells you "sorry, that won't work"
-- a search which drops characters from your search, without telling you and thus brings up a set of results which may not actually match your intended search.
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

Thantos

Quote from: Kindred on June 29, 2012, 09:19:37 AM
ummmm....   elmacik,

When one of our forums has resources like Google to catalog the searches and run against "exact match", then go for it...   until then, a single search like that on my forum would likely kill the server. As for matching doesn't to does not...   that would, of course depend upon cataloging and matching EVERY abbreviation/contraction with the non-abbreviation/contraction.


So, which would you rather have
-- a search that tells you "sorry, that won't work"
-- a search which drops characters from your search, without telling you and thus brings up a set of results which may not actually match your intended search.

How about c) A search which drops the characters, runs the search, tells you it dropped the characters, and displays the results from the search it could do.

CircleDock

Quote from: Kindred on June 29, 2012, 09:19:37 AM
When one of our forums has resources like Google to catalog the searches and run against "exact match", then go for it...   until then, a single search like that on my forum would likely kill the server. As for matching doesn't to does not...   that would, of course depend upon cataloging and matching EVERY abbreviation/contraction with the non-abbreviation/contraction.


So, which would you rather have
-- a search that tells you "sorry, that won't work"
-- a search which drops characters from your search, without telling you and thus brings up a set of results which may not actually match your intended search.
Actually all it requires is to index "doesn't" as "doesnot" - ie replace apostrophes that separate the letters "n" and "t" with the letter "o" ...

emanuele

And all the possible (correct and incorrect) grammatical variants of any other word?


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Kindred

ah, but CircleDock, n't is NOT the only contraction out there...

What about "we'll" - does that need to be cataloged as "we will" or maybe "we shall". How about "we're" or "you're"?
And then you get into possessives or (mis)uses of the apostrophe

What about British versus American spelling?
Should armor also return armour?
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

Elmacik

Well Kindred, I wasn't actually talking to SMF search only; but the search logic itself. Because Arantor said that search engines are not clever enough to match "doesn't" to "does not". They can match whole lot of words and phrases you can't even image at once. But of course I agree on what you say, these all are not efficient to implement in a forum search. Nevertheless I can still say that dropping the chars that are needed to be ignored is better; and it won't necessarily return a useless set of results. Because "a brown kitten" will match to "brown kitten" and it still makes sense without "a".
Home of Elmacik

Arantor

Oh, so you're going to split hairs. None of the search engines supported by SMF do this. Google does not do it reliably either, but what it does do is not match it based on it actually considering them as misspellings and comparing to a known lexicon.
Holder of controversial views, all of which my own.

Elmacik

Ah yes, "reliable" is really very relative. Simple matchings are done very well and most of the searchs return reasonable suggestions for your string. Yeah, you can say that it's not perfect and I'd agree on that; but it's quite unfair and absurd to say "not clever enough to match doesn't to does not".
Home of Elmacik

Arantor

It's not unfair at all. Google does not consistently or reliably have the two being the same thing at all.

For example I just did a search on 'it doesn't make sense' without the extra quotes, the first result back matches 'doesn't make sense' at one point in the page but it doesn't match against 'does not make sense' earlier in the same page.

I cannot find a search variation where 'doesn't' gives me the same results (even vaguely) as 'does not', and the same goes for the other contractions.

If you have a search setup where you have precise control over it, you can define your own (e.g. with Sphinx) so that you define that 'does not' and 'doesn't' are treated the same, but then you're doing it manually, knowing the specific cases related to your body of text to be tokenised and processed.
Holder of controversial views, all of which my own.

Elmacik

So you disprove your own statement; search engine systems do not really need to have a human brain to do simple matchings. As you say, you can make it even for yourself with a custom setup; not to mention greater search technologies then. The real statement should be "making not is not necessarily being not able to". And not "not clever enough to match". And I still say "brown kitten" makes pretty much sense for a search like "a brown kitten".
Home of Elmacik

Arantor

No, far from disproving my own statement, I backed it up with further trying to prove it: search engine systems still cannot do it reliably without humans intervening and doing it manually!

You're actually arguing to agree with what I'm saying: Google and most other systems will quite happily match 'brown kitten' with 'a brown kitten' because they are MORE THAN HAPPY to disregard the 'a' for being too short!

And Google frequently does, in my experience. However, please note that Google runs on a distributed network into the millions of computers, as opposed to a single server that any of us are running a forum on, and that we can't necessarily make it as 'smart' as Google given the limited resources we have...
Holder of controversial views, all of which my own.

Elmacik

C'mon, dropping chars like "a", "an" really doesn't require scientific ultrasonic supreme high-end servers like Google's. That's my 0.2$. Other than that, yeah, we can say it's nothing without human interference.
Home of Elmacik

Advertisement: