Simple Machines Community Forum

SMF Development => Feature Requests => Topic started by: GunSlingerBoy on June 23, 2012, 07:34:29 PM

Title: 2 character minimum in search
Post by: GunSlingerBoy on June 23, 2012, 07:34:29 PM
as an admin and a regular user on several SMF's this annoys me when i have to omit characters like "a" and numbers and such when they are critical to finding the posts and or threads
Title: Re: 2 character minimum in search
Post by: Arantor on June 23, 2012, 07:39:58 PM
They're not critical. In almost every case such really common words are automatically excluded, even when SMF doesn't have complete control over it.
Title: Re: 2 character minimum in search
Post by: Kindred on June 23, 2012, 11:47:17 PM
That request is just plain silly....

how is "a" critical in a search?

searching for "treatise petunias" will bring up the post with the title or contents: "A treatise on petunias"

searching for just "a" would bring up nearly every post....

Even google excludes words like "a" and "an"....
Title: Re: 2 character minimum in search
Post by: MrPhil on June 24, 2012, 11:58:43 AM
A word that is a single letter or even two letters is usually considered just "noise" and should be ignored. If it's something important to look for as part of a phrase, put it in quotation marks: "i have to omit".
Title: Re: 2 character minimum in search
Post by: kat on June 24, 2012, 12:00:43 PM
Quote from: Kindredlink=topic=479967.msg3359266#msg3359266 date=1340509637
That request is just plain silly....

Funnily enough, I agree with GunSlingerBoy, totally.

If you're searching for certain pieces of code, as I sometimes do, here, it's a pain in the arse. :P
Title: Re: 2 character minimum in search
Post by: Arantor on June 24, 2012, 12:02:01 PM
Even then most systems actually omit it anyway. Certainly Sphinx does by default (min length 3), MySQL FULLTEXT does (min length 3), and IIRC even a SMF custom index has a minimum length of 3 as well.

I can't imagine a piece of code that would meaningfully be found with a couple of characters either...
Title: Re: 2 character minimum in search
Post by: kat on June 24, 2012, 12:13:34 PM
For a coder, it's probably easier, because you'll know which bits can be omitted, without it affecting the result.

Us mere plebs don't have that advantage. :P
Title: Re: 2 character minimum in search
Post by: Kindred on June 24, 2012, 06:12:17 PM
I re-state.....

how is "a" critical in a search?

searching for "treatise petunias" will bring up the post with the title or contents: "A treatise on petunias"

searching for just "a" would bring up nearly every post....

Even google excludes words like "a" and "an"....
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 06:53:39 AM
You misunderstand me... (Not surprisingly) :P

What I mean, is when you get a quoted error message, you often get single characters. Especially if that message contains ">", "/", "a=" and the like.

If you copy that error and paste it into "Search", you get that bloody annoying error and a non-coder won't know which characters to omit.

I get around it, myself, by using the "Site search" thingy, on Google (Or whatever).

Does that make more sense?
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 07:11:32 AM
Most of those characters won't be searchable anyway... all the search backends discard non word characters to some degree, some more than others.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 07:17:46 AM
Yeah, I realise that. But, if you perform such a search, it would be good for it to just get on with it and not respond with that damned annoying message.

Can you see where I'm coming from?
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 07:20:29 AM
I can, except that experience suggests it won't pan out like that anyway.

What would be the point of not throwing the error, only to return with no responses, or worse, so many irrelevant responses?

Human time is not cheap, and invariably you're being kept at bay from having many many many pointless searches to wade through.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 07:32:33 AM
Well, you have more knowledge about how these things work than I do, obviously.

I just saw the original post and thought "Yeah! That pisses me off, too! I wonder if something couldn't be done, to improve that!" and that was that. Would, perhaps, an option to "Search exact term/phrase" improve that, assuming the single/double character was circumvented, work better, for that kinda thing?

(As you know, I know diddly-squat about all this coding stuff. I'm just kinda tossing things around, to see if such things could work and are practicable.

If, as Kindred seems to be implying, I'm talking total bollox (He may well be 100% correct, obviously), because it just ain't possible, I'd be fine, with that, as long as what you're saying's impossible, is what I actually mean to be asking for. (I'm not the best at expressing such things, as you may have noticed)
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 07:41:04 AM
Well, it isn't exactly possible, that's sort of the point.

Of the 4 methods of searching available for SMF, 2 of them will forcibly remove non word characters entirely and will strip words of one or two letters as well, and dealing with this is not something SMF can do, we're talking reconfiguration on a nasty technical level. (And one of these methods also comes with a very large list of words it will ignore anyway, of which things like 'the' are auto-excluded even though they're not 2 letters)

Of the other 2, that's a bit more fixable, but there are much greater issues at stake here.

Humour me. Do a search here on 'sp'. That's SimplePortal, right? Or not, as a search will quickly tell you. And therein lies the problem: possible or not, the 2 character minimum is there to prevent you getting flooded with many many many more results than you actually want or need.

As far as phrase matching goes, you're still sort of stuffed because the search backends aren't smart enough to cope with that properly (except Sphinx, and *maybe* MySQL FULLTEXT though that's way too slow to be usable anyway), so it still wouldn't solve your problem anyway without a total rewrite of the search system.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 07:52:26 AM
I can see your point. Especially about the "SP" thing.

But, with the facility to "Search exact phrase", would that, particularly with error message searches, obviate that?

If it could be done, so that "Exact term/phrase" mean exactly that and nothing else, at all, would that be a possibilty?

If I search for "A brown kitten" and expected just that and NOT "brown kitten", could that happen?
Title: Re: 2 character minimum in search
Post by: emanuele on June 25, 2012, 08:09:52 AM
K@ have you tried using the double quotes around the exact phrase?
http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten" (http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten")
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 08:10:14 AM
That's the point, it won't at all.

IIRC, with the exception of unindexed, the 'a' in 'a brown kitten' won't even be matched in the first place, so it won't get into the pass of tokenisation to match against phrases.

OK, search engine theory time. This is, broadly, how all of the search backends (except unindexed) work.

You start off going through all the posts, looking for sequences of words, a word being defined as multiple word-characters joined together. You store each of the words you find and the position in which you found them. Thus, you have your index - a list of words, a list of the positions they were in and the posts that contain them.

When you perform a search, you do the same thing: you break the search query up into groups of word-characters, before looking those up in your big list of words.

Matching words on their own, then, is simply a case of looking them up in the index and seeing which posts match the most words together.

Matching phrases is much the same except that you're looking for words in the same post and in contiguous positions.


This is how, ultimately one would match 'brown kitten'. But the 'a' will have been dropped long before it makes it into the index. Meaning that even if you could ignore the two-character requirement, there is no way to actually find it in the index, because it's not there.

In essence, when you search for 'a brown kitten', it's internally being parsed to 'brown kitten' before it gets thrown at the index.

Now, you can relax the rules around how indexes are built, sure, you can alter the criteria for what a word is and what 'word characters' are, how many letters etc. However, you need to go round and rebuild the index (which for any sizeable forum is a LARGE task), assuming it's even possible (MySQL FULLTEXT, I'm looking at you) and assuming that the word you're searching for is even going to be matched ('a' will be excluded by MySQL FULLTEXT even if you change the minimum word length there because of its commonality)

If you do decide to relax these restrictions, you will - guaranteed - have more white noise coming back in searches, for a vague and possible improvement to phrase searching. Even though it's still broken by the various mashing that goes on to deal with escaped characters, accented characters, inline bbc etc...
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 08:24:27 AM
K@ have you tried using the double quotes around the exact phrase?
http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten" (http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten")

I have, yeah. It still abhors single characters, if I remember rightly. :(

Thanks, for that, Pete. I'd assumed that that's how they worked, kinda.

What I was wondering, probably in an "Ideal world" scenario, is whether it could be made, so that if you selected "Exact phrase", the indexer thingy would ignore other settings and look just for that exact phrase, regardless of what that phrase contained, and nothing else.

Or, am I living in Cloud Cuckoo Land?
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 08:36:38 AM
That's broadly how it works in the unindexed scenario, which doesn't have such an index but just goes through, brute force, looking for things. But as you can imagine, it's *slow*.

Which is why you build indexes that can be searched much more quickly. But the search system and the indexer have to work the same way otherwise the effort of building the index is wasted.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 08:45:54 AM
Ah, yeah. The old speed consideration.

Lots to search through, here, I s'pose.

I'll have to stick with the ol' site search, then, ay? :)
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 08:49:33 AM
Sounds like a plan to me :)
Title: Re: 2 character minimum in search
Post by: emanuele on June 25, 2012, 08:51:17 AM
I wonder if I'm special somehow...

(http://img10.imageshack.us/img10/3852/screen68.png)
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 09:00:36 AM
Oh, you're special, all right. ;)

The difference, is that with engines, when you use the "exact phrase" option, should you include the word "a", it ignores that letter, if it appears as part of a word, rather than as a word unto itself, doesn't it?.
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 09:08:33 AM
What is highlighted is not what is searched ;)
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 09:12:52 AM
(http://www.katzy.dsl.pipex.com/Smileys/matchstick/shakehead.gif)(http://www.katzy.dsl.pipex.com/Smileys/matchstick/faint.gif)
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 09:15:45 AM
Quote
it ignores that letter, if it appears as part of a word, rather than as a word unto itself, doesn't it?.

Yes, that's the problem. It's too short to be a word on its own as far as search engines are generally concerned. The usual rule (certainly Sphinx and MySQL FULLTEXT default) is 3 letters.

But let me just drop another complexity into the mix: '

Is "doesn't" a word? Should it be searchable as such? If that's true, that would imply we should include the ' as part of a word, but then that screws up searching on forums that have code, e.g. $txt['string'] = 'string' - 'string' is not the same as string then...

And that's before you realise that SMF doesn't store the ' as a ' but as a & code.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 09:20:38 AM
I'd've imagined that a search motor would see the word "Doesn't", realise that it's a contraction of "Does not" and say "OK, I'll search both, in-situ, and put up the relavent results, along with a bit of pr0n, coz that's what they really wanna see".

At least, I'd want it to display something, rather than saying "Can't do that. Single letter words don't exist, sucker, so go do it all, again, so I can spit-out another error, just to piss you off".
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 09:24:21 AM
Oh no, no, no. Most search engines just are not that clever and would be thoroughly confused with such.

Search theory is still very young.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 09:33:46 AM
Sounds like we need a good coder to sort this one out, then... ;)

Maybe, if we can refine it and get it to work well, we could flog it to Google. ;)
Title: Re: 2 character minimum in search
Post by: emanuele on June 25, 2012, 09:41:56 AM
What is highlighted is not what is searched ;)
Right, what I clicked on the link I posted in my previous post:
K@ have you tried using the double quotes around the exact phrase?
http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten" (http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten")
that would be the same as type:
Code: [Select]
"A brown kitten" in the search box.
That is in fact the "exact phrase" thing you were asking about.
Title: Re: 2 character minimum in search
Post by: kat on June 25, 2012, 09:49:40 AM
Yeah, I know. The major engines do it, that way.

What I'm saying, is "Could it be done so that ONLY that exact phrase would appear and nothing else, at all".

Seems not, as things stand. :(
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 09:51:37 AM
Quote
That is in fact the "exact phrase" thing you were asking about.

Yes, but it still disregards the 'a' part of that. It still only searches on 'brown kitten', even if it highlights 'a brown kitten', and after thousands of support posts with Sphinx, I learned the hard way that having exact matching on single character words is actually a recipe for disaster.
Title: Re: 2 character minimum in search
Post by: emanuele on June 25, 2012, 10:06:26 AM
Let's come back a second to the original request:
as an admin and a regular user on several SMF's this annoys me when i have to omit characters like "a" and numbers and such when they are critical to finding the posts and or threads
He is not annoyed by the fact that "a" is not searched, he is annoyed by the fact that he gets the error "Each word must be at least two characters long".

So, even silently drop the single chars "words" and do the search on the remaining would probably be an acceptable solution to him. Of course searching
Code: [Select]
a e i o u would return an error too, but that would be less irritating probably.
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 10:11:13 AM
Therein lies the question: would it really be less irritating?

Would it really be less irritating to search for something you *know* is there and find that it can't find it?
Title: Re: 2 character minimum in search
Post by: emanuele on June 25, 2012, 11:08:56 AM
Disclaimer:
1) I know nothing about searches, all I'm saying it's just based on the material in this topic. ;)
2) it's too hot here and my brain is crashing every two seconds... ::)

Now we are forcing the user to remove what SMF cannot use to do a search, silently hide it would allow him to at least get a result.
Is it the result he was expecting? Well, considering the relevance of an "a" in a search the probability that it is are rather high I think. Additionally: if he wants to do the search he has to remove the "a" and he will get anyway only the result SMF will be able to provide him. Nothing more, nothing less.
Is that the result he was expecting? For sure will be the same result he will get with SMF silently dropping the single chars. I can imagine.
Instead, if he is enclosing the search string into double quotes SMF searches for it without tell anything to the user (so from his perspective it is searching for the entire string).

So, provided that: present or not the "a" doesn't make any difference. I'd expect that "present" or not it would return the same result (of course my experience here is basically null).

So, instead of saying "before being able to search anything you have to remove all the things SMF doesn't want (I know it's not that, but from the user perspective it is) to search", wouldn't be much friendlier to say: "We have searched without considering the "a" because anyway it would be meaningless search for such a piece of information"? (that AFAIR is what google was used to do a long time ago, but now has stopped)

It would be one less click and one less edit for the user searching something (an improvement I'd say), and the result will anyway be the same.

Am I completely off track?
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 11:17:28 AM
Quote
Now we are forcing the user to remove what SMF cannot use to do a search, silently hide it would allow him to at least get a result.

You get a result, maybe. It probably won't be the result expected, but that's dependent on a lot of things like search method too.

Quote
Additionally: if he wants to do the search he has to remove the "a" and he will get anyway only the result SMF will be able to provide him

Again it depends on the search method.

Quote
Am I completely off track?

I don't think you're completely off track but at the same time I'm not convinced that it's actually meaningful to silently allow that search too. Especially as the exact behaviour is dependent on the search engine used, which will not be consistent.

Here's the thing. Certain engines will treat that search string differently. Some will silently ignore 'a brown kitten' and parse it as 'brown kitten', even for phrase matching. Some will attempt to match it as 'a brown kitten' but the index won't have the a in it (certain versions of MySQL FULLTEXT) and so just return nothing at all.

Some will treat it as a literal and try to match it literally, but if there happened to be an extra space in there, all bets are off anyway when phrase matching because of everything else going on.

I honestly believe that if you change the behaviour of the front end to silently accept one-character searches, you're unfairly giving an expectation that it'll be matched in phrases etc. when it won't.

The correct compromise as far as I'm concerned would be to drop the single character for searching purposes, and tell the user that's what you've done, as you've outlined. However, that raises its own problems with respect to hitting the server unnecessarily hard because in all likelihood the user will then proceed to reword their query anyway - and as it happens in neither case would the OP actually get the result he wants...
Title: Re: 2 character minimum in search
Post by: emanuele on June 25, 2012, 11:30:46 AM
in neither case would the OP actually get the result he wants...
That's life... :P
Title: Re: 2 character minimum in search
Post by: MrPhil on June 25, 2012, 11:55:09 AM
Something interesting...
I'm guessing that in #3, since the 'i' in kitten was already matched, that word is 'taken' and the 'kitten' pattern is not applied against it.

It makes sense to ignore noise words (shorter than some minimum), but it would be nice to simply tell the user that short words are being ignored, rather than giving them a dope slap and telling them to try again. Using a quoted phrase, my expectation would be not that it be broken up into a list of words and individually matched (including the 'a'), but that the entire phrase be matched. Why else would I add quotation marks, except to indicate that I want the whole thing taken as an intact unit? BTW, that's how Google does it. (They do exclude punctuation within the target text, so possibly internally they are splitting it up, but only return cases where the subterms are adjacent and in the correct order, and all there.). I would say that SMF's search is broken because it breaks up a quoted phrase into individual words (including short words) and matches individually rather than if they are all present in the correct order.
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 11:57:52 AM
How, exactly, did you validate the above? Did you do it here or on a base SMF installation?

Here uses Sphinx, with some slightly atypical configuration therein.
Title: Re: 2 character minimum in search
Post by: MrPhil on June 25, 2012, 12:19:56 PM
Here, in this very topic.
Title: Re: 2 character minimum in search
Post by: Arantor on June 25, 2012, 12:21:55 PM
In which case, as per your bug report, you should ignore the results until you retest it on your own forum. As already noted, this forum runs Sphinx - an option not readily available - and I know from experience that it has an atypical configuration.

I'm not sure which API of Sphinx it's using, for instance, and one of those APIs may even be the unreliable one that even Sphinx itself discontinued after realising how buggy it was.
Title: Re: 2 character minimum in search
Post by: Douglas on June 28, 2012, 02:47:23 PM
My comments from another thread, and reasons why we need to change the search params to a "total number of characters minimum", rather than eliminating single characters.

Quote
Trying to find out why when I do an "include" in index.template.php, it's returning a one.  I'm searching "include returns a 1" and I get hit with this:

Each word must be at least two characters long.

We may want to drop this restriction, because sometimes there ARE things people are searching for that have single characters.

Now, I can understand (and fully support) if there's a minimum number of characters (say... 3), but don't limit the word search to two or more characters, please. :)

EDIT: Okay, so I did include returns "1" and got the same thing... so now I'm up sh*t's creek with this restriction.
Title: Re: 2 character minimum in search
Post by: kat on June 28, 2012, 02:55:09 PM
Amen, to that, Brother Douglas.
Title: Re: 2 character minimum in search
Post by: Arantor on June 28, 2012, 05:14:47 PM
Even though you can still remove the restriction and STILL not find what you were searching for, only now you get dozens o f unrelated, irrelevant matches, because that's obviously a better use of your time. ::)
Title: Re: 2 character minimum in search
Post by: MrPhil on June 28, 2012, 05:25:32 PM
If you restrict each search TERM (one word OR a "-enclosed phrase) to a minimum of 2 or 3 characters, that would make everyone happy. You could still have single letters, e.g., "include returns a 1" and only that phrase should be returned. Is there some reason SMF is breaking down phrases into single words and looking for them individually? If there is, only if all words are found, in order, should it be considered a hit. In the database, each post should be one long string, so I don't see why we can't search for a phrase at a time. On the other hand, if we're actually looking at some index with only single words, SMF needs to then check whether they all appear in the right order.
Title: Re: 2 character minimum in search
Post by: Arantor on June 28, 2012, 05:43:01 PM
Honestly did you not read what I posted?

Consider this site. It has 3 million posts. Its database must run into the gigabytes. Do you seriously intend that SMF should go through even 1GB of data to perform a search? Just consider how long that would take to perform, and how much hurt that's going to put on the server.

The ONLY solution is to build an index of the words, much smaller and faster to traverse to give you meaningful results.

In any case you can't even rely on your treasured exact-match to work. It only requires someone to put a double space in to the original and it's never going to work. (SMF converts all double spaces into space followed by an nbsp entity)
Title: Re: 2 character minimum in search
Post by: emanuele on June 28, 2012, 06:05:44 PM
You know...the best way to understand if something is what people really want is give them the tool and let figure it out by them selves.

That should just strip out from the search the "single-char words" and give a notice about the letters removed.

Search.php
Code: (find) [Select]
// Trim everything and make sure there are no words that are the same.
Code: (add before) [Select]
$context['search_ignored'] = array();
Code: (find) [Select]
// Don't allow very, very short words.
elseif ($smcFunc['strlen']($value) < 2)
{
$context['search_errors']['search_string_small_words'] = true;
unset($searchArray[$index]);
}
Code: (replace with) [Select]
// Don't allow very, very short words.
elseif ($smcFunc['strlen']($value) < 2)
{
$context['search_ignored'][] = $value;
unset($searchArray[$index]);
}

Search.template.php
The css will not work for sure because it comes from 2.1, the markup...no idea
Code: (find) [Select]
function template_results()
{
global $context, $settings, $options, $txt, $scripturl, $message;
Code: (add after) [Select]
if (!empty($context['search_ignored']))
echo '
<div id="search_results">
<div class="cat_bar">
<h3 class="catbg">
', $txt['generic_warning'], '
</h3>
</div>
<span class="upperframe"><span></span></span>
<div class="roundframe">
<p class="noticebox">', $txt['search_warning_ignored_word' . (count($context['search_ignored']) == 1 ? '' : 's')], ': ', implode(', ', $context['search_ignored']), '</p>
</div>
<span class="lowerframe"><span></span></span>
</div><br />';

Code: (find) [Select]
if (!empty($context['search_errors']))
echo '
<p class="errorbox">', implode('<br />', $context['search_errors']['messages']), '</p>';
Code: (add before) [Select]
if (!empty($context['search_ignored']))
echo '
<p class="noticebox">', $txt['search_warning_ignored_word' . (count($context['search_ignored']) == 1 ? '' : 's')], ': ', implode(', ', $context['search_ignored']), '</p>';

Search.english.php
Code: (find) [Select]
$txt['search_adjust_query'] = 'Adjust Search Parameters';
Code: (add after) [Select]
$txt['search_warning_ignored_word'] = 'This term has been ignored in your search';
$txt['search_warning_ignored_words'] = 'These terms have been ignored in your search';
Title: Re: 2 character minimum in search
Post by: Elmacik on June 29, 2012, 02:04:20 AM
How can we say that search engines are not clever enough to match "doesn't" to "does not"? They can do it pretty well and much more of it. I say that instead of warning a puke at the users face; doing like other search engines and dropping that chars is better when possible.
Title: Re: 2 character minimum in search
Post by: Kindred on June 29, 2012, 09:19:37 AM
ummmm....   elmacik,

When one of our forums has resources like Google to catalog the searches and run against "exact match", then go for it...   until then, a single search like that on my forum would likely kill the server. As for matching doesn't to does not...   that would, of course depend upon cataloging and matching EVERY abbreviation/contraction with the non-abbreviation/contraction.


So, which would you rather have
-- a search that tells you "sorry, that won't work"
-- a search which drops characters from your search, without telling you and thus brings up a set of results which may not actually match your intended search.
Title: Re: 2 character minimum in search
Post by: Thantos on June 29, 2012, 10:36:46 AM
ummmm....   elmacik,

When one of our forums has resources like Google to catalog the searches and run against "exact match", then go for it...   until then, a single search like that on my forum would likely kill the server. As for matching doesn't to does not...   that would, of course depend upon cataloging and matching EVERY abbreviation/contraction with the non-abbreviation/contraction.


So, which would you rather have
-- a search that tells you "sorry, that won't work"
-- a search which drops characters from your search, without telling you and thus brings up a set of results which may not actually match your intended search.

How about c) A search which drops the characters, runs the search, tells you it dropped the characters, and displays the results from the search it could do.
Title: Re: 2 character minimum in search
Post by: CircleDock on June 29, 2012, 11:14:27 AM
When one of our forums has resources like Google to catalog the searches and run against "exact match", then go for it...   until then, a single search like that on my forum would likely kill the server. As for matching doesn't to does not...   that would, of course depend upon cataloging and matching EVERY abbreviation/contraction with the non-abbreviation/contraction.


So, which would you rather have
-- a search that tells you "sorry, that won't work"
-- a search which drops characters from your search, without telling you and thus brings up a set of results which may not actually match your intended search.
Actually all it requires is to index "doesn't" as "doesnot" - ie replace apostrophes that separate the letters "n" and "t" with the letter "o" ...
Title: Re: 2 character minimum in search
Post by: emanuele on June 29, 2012, 11:22:06 AM
And all the possible (correct and incorrect) grammatical variants of any other word?
Title: Re: 2 character minimum in search
Post by: Kindred on June 29, 2012, 11:48:50 AM
ah, but CircleDock, n't is NOT the only contraction out there...

What about "we'll" - does that need to be cataloged as "we will" or maybe "we shall". How about "we're" or "you're"?
And then you get into possessives or (mis)uses of the apostrophe

What about British versus American spelling?
Should armor also return armour?
Title: Re: 2 character minimum in search
Post by: Elmacik on June 29, 2012, 12:32:51 PM
Well Kindred, I wasn't actually talking to SMF search only; but the search logic itself. Because Arantor said that search engines are not clever enough to match "doesn't" to "does not". They can match whole lot of words and phrases you can't even image at once. But of course I agree on what you say, these all are not efficient to implement in a forum search. Nevertheless I can still say that dropping the chars that are needed to be ignored is better; and it won't necessarily return a useless set of results. Because "a brown kitten" will match to "brown kitten" and it still makes sense without "a".
Title: Re: 2 character minimum in search
Post by: Arantor on June 29, 2012, 10:31:24 PM
Oh, so you're going to split hairs. None of the search engines supported by SMF do this. Google does not do it reliably either, but what it does do is not match it based on it actually considering them as misspellings and comparing to a known lexicon.
Title: Re: 2 character minimum in search
Post by: Elmacik on June 30, 2012, 08:45:08 AM
Ah yes, "reliable" is really very relative. Simple matchings are done very well and most of the searchs return reasonable suggestions for your string. Yeah, you can say that it's not perfect and I'd agree on that; but it's quite unfair and absurd to say "not clever enough to match doesn't to does not".
Title: Re: 2 character minimum in search
Post by: Arantor on June 30, 2012, 05:52:12 PM
It's not unfair at all. Google does not consistently or reliably have the two being the same thing at all.

For example I just did a search on 'it doesn't make sense' without the extra quotes, the first result back matches 'doesn't make sense' at one point in the page but it doesn't match against 'does not make sense' earlier in the same page.

I cannot find a search variation where 'doesn't' gives me the same results (even vaguely) as 'does not', and the same goes for the other contractions.

If you have a search setup where you have precise control over it, you can define your own (e.g. with Sphinx) so that you define that 'does not' and 'doesn't' are treated the same, but then you're doing it manually, knowing the specific cases related to your body of text to be tokenised and processed.
Title: Re: 2 character minimum in search
Post by: Elmacik on July 01, 2012, 06:45:01 AM
So you disprove your own statement; search engine systems do not really need to have a human brain to do simple matchings. As you say, you can make it even for yourself with a custom setup; not to mention greater search technologies then. The real statement should be "making not is not necessarily being not able to". And not "not clever enough to match". And I still say "brown kitten" makes pretty much sense for a search like "a brown kitten".
Title: Re: 2 character minimum in search
Post by: Arantor on July 01, 2012, 10:06:33 AM
No, far from disproving my own statement, I backed it up with further trying to prove it: search engine systems still cannot do it reliably without humans intervening and doing it manually!

You're actually arguing to agree with what I'm saying: Google and most other systems will quite happily match 'brown kitten' with 'a brown kitten' because they are MORE THAN HAPPY to disregard the 'a' for being too short!

And Google frequently does, in my experience. However, please note that Google runs on a distributed network into the millions of computers, as opposed to a single server that any of us are running a forum on, and that we can't necessarily make it as 'smart' as Google given the limited resources we have...
Title: Re: 2 character minimum in search
Post by: Elmacik on July 01, 2012, 12:12:42 PM
C'mon, dropping chars like "a", "an" really doesn't require scientific ultrasonic supreme high-end servers like Google's. That's my 0.2$. Other than that, yeah, we can say it's nothing without human interference.