2 character minimum in search

Started by GunSlingerBoy, June 23, 2012, 07:34:29 PM

Previous topic - Next topic

Arantor

Holder of controversial views, all of which my own.


emanuele

I wonder if I'm special somehow...



Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

kat

Oh, you're special, all right. ;)

The difference, is that with engines, when you use the "exact phrase" option, should you include the word "a", it ignores that letter, if it appears as part of a word, rather than as a word unto itself, doesn't it?.

Arantor

What is highlighted is not what is searched ;)
Holder of controversial views, all of which my own.



Arantor

Quoteit ignores that letter, if it appears as part of a word, rather than as a word unto itself, doesn't it?.

Yes, that's the problem. It's too short to be a word on its own as far as search engines are generally concerned. The usual rule (certainly Sphinx and MySQL FULLTEXT default) is 3 letters.

But let me just drop another complexity into the mix: '

Is "doesn't" a word? Should it be searchable as such? If that's true, that would imply we should include the ' as part of a word, but then that screws up searching on forums that have code, e.g. $txt['string'] = 'string' - 'string' is not the same as string then...

And that's before you realise that SMF doesn't store the ' as a ' but as a & code.
Holder of controversial views, all of which my own.


kat

I'd've imagined that a search motor would see the word "Doesn't", realise that it's a contraction of "Does not" and say "OK, I'll search both, in-situ, and put up the relavent results, along with a bit of pr0n, coz that's what they really wanna see".

At least, I'd want it to display something, rather than saying "Can't do that. Single letter words don't exist, sucker, so go do it all, again, so I can spit-out another error, just to piss you off".

Arantor

Oh no, no, no. Most search engines just are not that clever and would be thoroughly confused with such.

Search theory is still very young.
Holder of controversial views, all of which my own.


kat

Sounds like we need a good coder to sort this one out, then... ;)

Maybe, if we can refine it and get it to work well, we could flog it to Google. ;)

emanuele

Quote from: Arantor on June 25, 2012, 09:08:33 AM
What is highlighted is not what is searched ;)
Right, what I clicked on the link I posted in my previous post:
Quote from: emanuele on June 25, 2012, 08:09:52 AM
K@ have you tried using the double quotes around the exact phrase?
http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten"
that would be the same as type:
"A brown kitten" in the search box.
That is in fact the "exact phrase" thing you were asking about.


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

kat

Yeah, I know. The major engines do it, that way.

What I'm saying, is "Could it be done so that ONLY that exact phrase would appear and nothing else, at all".

Seems not, as things stand. :(

Arantor

QuoteThat is in fact the "exact phrase" thing you were asking about.

Yes, but it still disregards the 'a' part of that. It still only searches on 'brown kitten', even if it highlights 'a brown kitten', and after thousands of support posts with Sphinx, I learned the hard way that having exact matching on single character words is actually a recipe for disaster.
Holder of controversial views, all of which my own.


emanuele

Let's come back a second to the original request:
Quote from: GunSlingerBoy on June 23, 2012, 07:34:29 PM
as an admin and a regular user on several SMF's this annoys me when i have to omit characters like "a" and numbers and such when they are critical to finding the posts and or threads
He is not annoyed by the fact that "a" is not searched, he is annoyed by the fact that he gets the error "Each word must be at least two characters long".

So, even silently drop the single chars "words" and do the search on the remaining would probably be an acceptable solution to him. Of course searching a e i o u would return an error too, but that would be less irritating probably.


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Arantor

Therein lies the question: would it really be less irritating?

Would it really be less irritating to search for something you *know* is there and find that it can't find it?
Holder of controversial views, all of which my own.


emanuele

Disclaimer:
1) I know nothing about searches, all I'm saying it's just based on the material in this topic. ;)
2) it's too hot here and my brain is crashing every two seconds... ::)

Now we are forcing the user to remove what SMF cannot use to do a search, silently hide it would allow him to at least get a result.
Is it the result he was expecting? Well, considering the relevance of an "a" in a search the probability that it is are rather high I think. Additionally: if he wants to do the search he has to remove the "a" and he will get anyway only the result SMF will be able to provide him. Nothing more, nothing less.
Is that the result he was expecting? For sure will be the same result he will get with SMF silently dropping the single chars. I can imagine.
Instead, if he is enclosing the search string into double quotes SMF searches for it without tell anything to the user (so from his perspective it is searching for the entire string).

So, provided that: present or not the "a" doesn't make any difference. I'd expect that "present" or not it would return the same result (of course my experience here is basically null).

So, instead of saying "before being able to search anything you have to remove all the things SMF doesn't want (I know it's not that, but from the user perspective it is) to search", wouldn't be much friendlier to say: "We have searched without considering the "a" because anyway it would be meaningless search for such a piece of information"? (that AFAIR is what google was used to do a long time ago, but now has stopped)

It would be one less click and one less edit for the user searching something (an improvement I'd say), and the result will anyway be the same.

Am I completely off track?


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Arantor

QuoteNow we are forcing the user to remove what SMF cannot use to do a search, silently hide it would allow him to at least get a result.

You get a result, maybe. It probably won't be the result expected, but that's dependent on a lot of things like search method too.

QuoteAdditionally: if he wants to do the search he has to remove the "a" and he will get anyway only the result SMF will be able to provide him

Again it depends on the search method.

QuoteAm I completely off track?

I don't think you're completely off track but at the same time I'm not convinced that it's actually meaningful to silently allow that search too. Especially as the exact behaviour is dependent on the search engine used, which will not be consistent.

Here's the thing. Certain engines will treat that search string differently. Some will silently ignore 'a brown kitten' and parse it as 'brown kitten', even for phrase matching. Some will attempt to match it as 'a brown kitten' but the index won't have the a in it (certain versions of MySQL FULLTEXT) and so just return nothing at all.

Some will treat it as a literal and try to match it literally, but if there happened to be an extra space in there, all bets are off anyway when phrase matching because of everything else going on.

I honestly believe that if you change the behaviour of the front end to silently accept one-character searches, you're unfairly giving an expectation that it'll be matched in phrases etc. when it won't.

The correct compromise as far as I'm concerned would be to drop the single character for searching purposes, and tell the user that's what you've done, as you've outlined. However, that raises its own problems with respect to hitting the server unnecessarily hard because in all likelihood the user will then proceed to reword their query anyway - and as it happens in neither case would the OP actually get the result he wants...
Holder of controversial views, all of which my own.


emanuele

Quote from: Arantor on June 25, 2012, 11:17:28 AM
in neither case would the OP actually get the result he wants...
That's life... :P


Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

MrPhil

Something interesting...

  • a brown kitten gives the error message that words must be at least two characters
  • "a brown kitten" matches the letter 'a' anywhere, plus 'brown', plus 'kitten' (the phrase brown kitten is matched, even if it doesn't have an a in front)
  • "i brown kitten" matches the letter 'i' anywhere, plus 'brown', but does not match 'kitten'
  • "q brown kitten" matches the letter 'q' anywhere, plus 'brown' and 'kitten'
I'm guessing that in #3, since the 'i' in kitten was already matched, that word is 'taken' and the 'kitten' pattern is not applied against it.

It makes sense to ignore noise words (shorter than some minimum), but it would be nice to simply tell the user that short words are being ignored, rather than giving them a dope slap and telling them to try again. Using a quoted phrase, my expectation would be not that it be broken up into a list of words and individually matched (including the 'a'), but that the entire phrase be matched. Why else would I add quotation marks, except to indicate that I want the whole thing taken as an intact unit? BTW, that's how Google does it. (They do exclude punctuation within the target text, so possibly internally they are splitting it up, but only return cases where the subterms are adjacent and in the correct order, and all there.). I would say that SMF's search is broken because it breaks up a quoted phrase into individual words (including short words) and matches individually rather than if they are all present in the correct order.

Arantor

How, exactly, did you validate the above? Did you do it here or on a base SMF installation?

Here uses Sphinx, with some slightly atypical configuration therein.
Holder of controversial views, all of which my own.


MrPhil


Advertisement: