News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

what's difference between full text search index and custom search index?

Started by brynn, July 12, 2014, 10:28:19 PM

Previous topic - Next topic

brynn

Hi Friends,
I've seen this info (http://wiki.simplemachines.org/smf/SMF2.0:Search_(admin)) in the online manual, but am looking for any further explanation.  Is there any other documentation about setting up the forum search?

Otherwise, what is the difference between the full text search index and a custom search index?  Are there certain circumstances where one would be preferable to the other?

Thanks for your help  :)

Branko.

Administration Center>Search>Search Method>Why create a search index? and read the explanation  ;D
Strong people don't put others down, they lift them up.
A clever person solves a problem. A wise person avoids it.

brynn

Thanks Branko.

I had read that too, in addition to the manual.  All it really tells me is that a custom index is bigger and has better performance than full text. 

But it doesn't really tell me much.  What makes a custom index different?  Is it because I can add search terms to it myself?  Or maybe it has other more configurable options, than full text index or no index?

Does making a search index add a new database?


Arantor

Well, a fulltext index is one applied by MySQL itself to the table. There are constraints around when that can be used (table type, size of body field), but it's self-maintaining, largely automatic.

A custom index, on the other hand, is maintained solely by SMF code. SMF does all the hard work and in theory it can be tuned while a fulltext index can't be tuned unless you can tweak MySQL's own parameters.

Now, a custom index will generally be faster than the fulltext index, however the fulltext index is better when you're dealing with matching similar words or partial words (something that generally can't be handled properly by SMF's index)

Custom index requires new tables, not new database.

Which would I recommend? Honestly if it were entirely up to me, I wouldn't use either, I'd use a separate search indexer like Sphinx or ElasticSearch but there are some headaches around implementation thereof due to design limitations in SMF, which are improved in 2.1.

ApplianceJunk

I have "No Index" at this time.

My Fulltext index option says...

Quote
Index: cannot be created because the max message length is above 65,535 or table type is not MyISAM

My stats say I have 74969 Posts. Is max message length the same thing as number of post?

In any case think I will go with the "do nothing" option at this time, unless someone changes my mind.

Thanks Arantor or who ever you are? ;)

Arantor

No, the max message length is the setting you've set for the longest content of message.

No index is fine if searching isn't a huge deal.

Technically, I'm still Arantor, these days I use the interrobang as my display name, mostly to limit people sending me PMs ;)

ApplianceJunk

Oh, I completely misunderstood max message length. Seems to make sense now. thanks!

I bookmarked your profile long ago as I find your post very educational, entertaining and enjoyable to read. :D

Thanks!

mashby

I prefer Arantor, although now I am understanding the symbol. Reminds me of purple rain, although you are not a musician per se.
Always be a little kinder than necessary.
- James M. Barrie

Arantor

Actually the interrobang - ? combined with ! - was indicated on another forum I frequent as being the sarcasm symbol. It seemed apt ;)

brynn

Well the word "custom" leaves one with the impression that a custom index can be "customized".  But if I understand !?, one would need to be able to edit the SMF code to customize it.  So for most people who aren't programmers, it sounds like that would not really be a viable option.  Is that a correct understanding?

If my forum ever takes off, searching will become much more important.  At the moment, it's still very small.  But thank you so much for taking the time to explain it for me.  I'll also be interested to look into the 3rd party apps you mentioned.

Actually, I've suspected that there must be other options for search, ever since I registered here.  Because here we have the ability to choose where to search (attached screenie).  Maybe that has something to do with my Tiny Portal installation, I'm not sure.  But that would be a nice feature to have.  Do you (or anyone) know how to get that?  Is there a mod I missed?

One last question about searching.  My forum is for support for Inkscape (open source vector graphics editor) (like GIMP except for vector graphics).  One of Inkscape's tools is commonly called a Pen.  But if I search for "pen", I get results for "pencil", "open", "happen", "depend", etc.  Is there any way to have it search strictly for "pen" by itself?

Thanks again   :)

Arantor

The use of 'custom' indicates that it isn't a standard one, it's SMF's own. And it can be tweaked, to balance size vs speed, an option MySQL itself wouldn't offer you.

What you're pointing to is a mod, it's a modified version of the Search Focus Dropdown mod.

Match whole word... that's voodoo in any of the search systems since invariably that's not what people want...

brynn

Oh, I see.  It looks like I had found that mod, but it throws me an error when I try to install it.  I haven't learned how to manually install yet -- but  hopefully soon.

I'm not sure what you're implying with "voodoo"....maybe "mysterious"?  But I get the same results, whether I have "Match whole words only" checked or not.

Arantor

Oh, it's mysterious. It's sort of like magic except less understandable.

You see, computers have no benefit of context or meaning. To them, text is just data. There's no concept of words in computing, simply that certain groups of numbers together with other numbers in the middle are 'units' of content.

Now, all the search engines (though to a lesser degree SMF custom index) have the concept of stem matching (or prefix matching, it's got a bunch of names, though stemming is much more intricate), whereby you're not really matching complete words, but partial words to find variant spellings or similar words. Walking vs walker vs walks, for example. A computer still has no idea of the relevance, but given enough logic and data, it can make some degree of guess.

Matching exact words in such cases is... complicated... because it has to try to weight exact matching vs similar for various degrees of similar, and the entire process is more than a little magical (and so badly understood even by people who've spent time in the field)

brynn

Ok -- as my mom would say "clear as mud"!

Thank you so much for taking the time to answer my questions.  Now I have a much better idea about searching, and how I want to set it up in my site.

All best   :D

Arantor

Well, just to clarify one thing you didn't ask, but would be prudent for you to understand, since it will answer one unasked question.

What does 'no index' searching actually do? By comparison what does an index actually do? This may go some way to explaining why it is voodoo.

No index means brute force. It just goes through message by message, looking for the sequence of letters you asked for. Since it has no understanding of words or anything, it's simply looking for any instance that matches. For example 'pen' will match 'pencil', 'depend' and so on simply because it matches those three letters. It's not fast and subject to wild inaccuracies for that sort of reason; while one might infer pencil from pen, depend is almost certainly a result you don't want.

So to combat this approach, this is what we build indexes for. An index, effectively, is told what things are words (sequences of characters that are characters that can be in words, where there's at least x such characters, usually 4) and it simply builds a list of what words are in what posts. If you already know what there is, you can simplify the searching process dramatically under the right circumstances.

But this is where stemming gets insane. Most setups implement stemming as simple prefixing, as in if you had a post with 'pencil' in it, the index would probably end up containing 'pen', 'penc', 'penci' and 'pencil' as related matches. But it's a hugely complex task.

For example is ' (apostrophe) a word character or not? On the one hand, you want to match "don't" or "wouldn't", but you don't want to exclude a match on the word example when 'example' is in the text. This is where it starts getting into real voodoo because there have been attempts to fix this stuff and make a sane guess as to what was intended but second guessing people is invariably bad.

margarett

Se forem conduzir, não bebam. Se forem beber... CHAMEM-ME!!!! :D

QuoteOver 90% of all computer problems can be traced back to the interface between the keyboard and the chair

Biology Forums

In terms of performance (speed of search and database calls), which one is recommended, Fulltext index or Custom index? (and what's the difference between the three options here)...

Arantor

I already explained the difference between no-index and having an index.

Fulltext is probably 'better' in terms of relevance, custom was always historically marginally faster but less useful, and custom is certainly larger in the database compared to a fulltext index.

Number of database calls is an irrelevant metric and should be disregarded for any sane discussion on this purpose.

If performance is your only criteria, custom. However note that 1) it's less useful in practice than fulltext leading to 2) people doing more searches with it, leading to 3) is performance your only, or merely primary, criteria?

I could go into a long diatribe about how performance should not be your only factor in making such decisions and also as to how the mechanics of each might play a part; this is not even the same between 1.1.x and 2.0, let alone even the same between different sites. There's also entire books that could be written to cover topics beyond merely this question, like whether you start involving analysers, stemmers and various things (btw, neither fulltext nor custom implements these though at least with fulltext it would be possible under very specific situations)

Biology Forums

What's mainly on my mind is, will I still be able to make exact search queries with quotations on either settings?

Arantor

What are you calling "exact search queries" because I guarantee what you call that is not what either index type will call it.

Biology Forums

"can I search something like this" for example, and can I search something like this will be found, word for word.

Arantor



Arantor

No, you should take it as the fact it is, that it doesn't work exactly how you envisage it under any circumstances.

Advertisement: