News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

2 character minimum in search

Started by GunSlingerBoy, June 23, 2012, 07:34:29 PM

Previous topic - Next topic

GunSlingerBoy

as an admin and a regular user on several SMF's this annoys me when i have to omit characters like "a" and numbers and such when they are critical to finding the posts and or threads

Arantor

They're not critical. In almost every case such really common words are automatically excluded, even when SMF doesn't have complete control over it.

Kindred

That request is just plain silly....

how is "a" critical in a search?

searching for "treatise petunias" will bring up the post with the title or contents: "A treatise on petunias"

searching for just "a" would bring up nearly every post....

Even google excludes words like "a" and "an"....
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

MrPhil

A word that is a single letter or even two letters is usually considered just "noise" and should be ignored. If it's something important to look for as part of a phrase, put it in quotation marks: "i have to omit".

kat

Quote from: Kindredlink=topic=479967.msg3359266#msg3359266 date=1340509637
That request is just plain silly....

Funnily enough, I agree with GunSlingerBoy, totally.

If you're searching for certain pieces of code, as I sometimes do, here, it's a pain in the arse. :P

Arantor

Even then most systems actually omit it anyway. Certainly Sphinx does by default (min length 3), MySQL FULLTEXT does (min length 3), and IIRC even a SMF custom index has a minimum length of 3 as well.

I can't imagine a piece of code that would meaningfully be found with a couple of characters either...

kat

For a coder, it's probably easier, because you'll know which bits can be omitted, without it affecting the result.

Us mere plebs don't have that advantage. :P

Kindred

I re-state.....

Quote from: Kindred on June 23, 2012, 11:47:17 PM
how is "a" critical in a search?

searching for "treatise petunias" will bring up the post with the title or contents: "A treatise on petunias"

searching for just "a" would bring up nearly every post....

Even google excludes words like "a" and "an"....
Слaва
Украинi

Please do not PM, IM or Email me with support questions.  You will get better and faster responses in the support boards.  Thank you.

"Loki is not evil, although he is certainly not a force for good. Loki is... complicated."

kat

You misunderstand me... (Not surprisingly) :P

What I mean, is when you get a quoted error message, you often get single characters. Especially if that message contains ">", "/", "a=" and the like.

If you copy that error and paste it into "Search", you get that bloody annoying error and a non-coder won't know which characters to omit.

I get around it, myself, by using the "Site search" thingy, on Google (Or whatever).

Does that make more sense?

Arantor

Most of those characters won't be searchable anyway... all the search backends discard non word characters to some degree, some more than others.

kat

Yeah, I realise that. But, if you perform such a search, it would be good for it to just get on with it and not respond with that damned annoying message.

Can you see where I'm coming from?

Arantor

I can, except that experience suggests it won't pan out like that anyway.

What would be the point of not throwing the error, only to return with no responses, or worse, so many irrelevant responses?

Human time is not cheap, and invariably you're being kept at bay from having many many many pointless searches to wade through.

kat

Well, you have more knowledge about how these things work than I do, obviously.

I just saw the original post and thought "Yeah! That pisses me off, too! I wonder if something couldn't be done, to improve that!" and that was that. Would, perhaps, an option to "Search exact term/phrase" improve that, assuming the single/double character was circumvented, work better, for that kinda thing?

(As you know, I know diddly-squat about all this coding stuff. I'm just kinda tossing things around, to see if such things could work and are practicable.

If, as Kindred seems to be implying, I'm talking total bollox (He may well be 100% correct, obviously), because it just ain't possible, I'd be fine, with that, as long as what you're saying's impossible, is what I actually mean to be asking for. (I'm not the best at expressing such things, as you may have noticed)

Arantor

Well, it isn't exactly possible, that's sort of the point.

Of the 4 methods of searching available for SMF, 2 of them will forcibly remove non word characters entirely and will strip words of one or two letters as well, and dealing with this is not something SMF can do, we're talking reconfiguration on a nasty technical level. (And one of these methods also comes with a very large list of words it will ignore anyway, of which things like 'the' are auto-excluded even though they're not 2 letters)

Of the other 2, that's a bit more fixable, but there are much greater issues at stake here.

Humour me. Do a search here on 'sp'. That's SimplePortal, right? Or not, as a search will quickly tell you. And therein lies the problem: possible or not, the 2 character minimum is there to prevent you getting flooded with many many many more results than you actually want or need.

As far as phrase matching goes, you're still sort of stuffed because the search backends aren't smart enough to cope with that properly (except Sphinx, and *maybe* MySQL FULLTEXT though that's way too slow to be usable anyway), so it still wouldn't solve your problem anyway without a total rewrite of the search system.

kat

I can see your point. Especially about the "SP" thing.

But, with the facility to "Search exact phrase", would that, particularly with error message searches, obviate that?

If it could be done, so that "Exact term/phrase" mean exactly that and nothing else, at all, would that be a possibilty?

If I search for "A brown kitten" and expected just that and NOT "brown kitten", could that happen?

emanuele



Take a peek at what I'm doing! ;D




Hai bisogno di supporto in Italiano?

Aiutateci ad aiutarvi: spiegate bene il vostro problema: no, "non funziona" non è una spiegazione!!
1) Cosa fai,
2) cosa ti aspetti,
3) cosa ottieni.

Arantor

That's the point, it won't at all.

IIRC, with the exception of unindexed, the 'a' in 'a brown kitten' won't even be matched in the first place, so it won't get into the pass of tokenisation to match against phrases.

OK, search engine theory time. This is, broadly, how all of the search backends (except unindexed) work.

You start off going through all the posts, looking for sequences of words, a word being defined as multiple word-characters joined together. You store each of the words you find and the position in which you found them. Thus, you have your index - a list of words, a list of the positions they were in and the posts that contain them.

When you perform a search, you do the same thing: you break the search query up into groups of word-characters, before looking those up in your big list of words.

Matching words on their own, then, is simply a case of looking them up in the index and seeing which posts match the most words together.

Matching phrases is much the same except that you're looking for words in the same post and in contiguous positions.


This is how, ultimately one would match 'brown kitten'. But the 'a' will have been dropped long before it makes it into the index. Meaning that even if you could ignore the two-character requirement, there is no way to actually find it in the index, because it's not there.

In essence, when you search for 'a brown kitten', it's internally being parsed to 'brown kitten' before it gets thrown at the index.

Now, you can relax the rules around how indexes are built, sure, you can alter the criteria for what a word is and what 'word characters' are, how many letters etc. However, you need to go round and rebuild the index (which for any sizeable forum is a LARGE task), assuming it's even possible (MySQL FULLTEXT, I'm looking at you) and assuming that the word you're searching for is even going to be matched ('a' will be excluded by MySQL FULLTEXT even if you change the minimum word length there because of its commonality)

If you do decide to relax these restrictions, you will - guaranteed - have more white noise coming back in searches, for a vague and possible improvement to phrase searching. Even though it's still broken by the various mashing that goes on to deal with escaped characters, accented characters, inline bbc etc...

kat

Quote from: emanuele on June 25, 2012, 08:09:52 AM
K@ have you tried using the double quotes around the exact phrase?
http://www.simplemachines.org/community/index.php?action=search2;search="A brown kitten"

I have, yeah. It still abhors single characters, if I remember rightly. :(

Thanks, for that, Pete. I'd assumed that that's how they worked, kinda.

What I was wondering, probably in an "Ideal world" scenario, is whether it could be made, so that if you selected "Exact phrase", the indexer thingy would ignore other settings and look just for that exact phrase, regardless of what that phrase contained, and nothing else.

Or, am I living in Cloud Cuckoo Land?

Arantor

That's broadly how it works in the unindexed scenario, which doesn't have such an index but just goes through, brute force, looking for things. But as you can imagine, it's *slow*.

Which is why you build indexes that can be searched much more quickly. But the search system and the indexer have to work the same way otherwise the effort of building the index is wasted.

kat

Ah, yeah. The old speed consideration.

Lots to search through, here, I s'pose.

I'll have to stick with the ol' site search, then, ay? :)

Advertisement: