News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

Rethinking search

Started by Arantor, April 09, 2022, 09:10:29 AM

Previous topic - Next topic

Arantor

I think it's time to talk about rethinking search.

2.1 does sort of nibble at the edges of what should be done, but ultimately the whole things needs a rework. The rules have changed since 2006, the needs have changed too.


First up, getting rid of the custom index. The value of it is simply not what it once was. The lack of stemming support is also painful in usability, and the optimisation vs MySQL's own fulltext is no longer true in any case I can find.

Secondly, making it actually pluggable - bits of the code for the custom index and the fulltext index are intermashed throughout. Basically, it should be a proper API where the code simply says 'here's a new message, board 1, topic 2, message 3, its content is XYZ' and the search backend figures out how to put that into the index. Similarly the search frontend should then solely query the API.

Thirdly, supporting other content types. The search index is currently forum only, but making it know about different kinds of content wouldn't be particularly hard to do. Define content types (such as message), plus a scope (topic) and a permission boundary (board), and you can build a system that can cope with searching all the things from all the places. LevGal would totally have used this if it had been an option.

Fourth, the search index shouldn't just accept the bare content - ever. This is surprisingly ineffective in SMF in general if you're not careful, and renders a lot of things very unhelpful. What it should do is parse the bbcode, remove things that are quotes of other messages (where can reliably defined as such), flatten out the preserved spacing and index the content that's left. This also means, incidentally, that there's no bootstrapping of the index on top of the messages table so no fudging around of column type to support fulltext on older MySQL versions. (Side, side note: just make it a mediumtext already)

Fifth, the search system should support its own index by default (with none of this 'no index' nonsense), plus ideally ElasticSearch and maybe Sphinx; ES won the war even if Sphinx still wins on performance, just because ES is easier to throw things into.

Last, make the index rebuild a background task.


Most people won't see a significant alteration to their life, other than the folks who never had a search index before now getting more space usage, and search improves for everyone. On top of that the folks who do the modding thing should be able to hook into the search system meaning that content just becomes more discoverable by design.

Other than the effort of implementation I see no downside to any of this.

spiros

I once did make a recommendation on ES (used by Xenforo among others). You replied there too!

I would also like to see something like search autocomplete out of the box (for example with a separate index just for topic subjects). It is the way search is done nowadays. Bugo has written a nice mod to provide similar topics autocomplete when creating new topics, this could be enhanced so as to run on standard search as well as support Sphinx.

Arantor

Search autocomplete is actually massively hard to do. It only works well against certain kinds of corpus, and unless you're doing it just against titles (which isn't particularly useful - not even in 'similar topics' contexts), it's hard to do *well*.

The only reason Google does it even passably well is because they're basing it fairly heavily on what people search for, and can use the volume of search queries to improve that index - something the rest of us simply won't have.

spiros

The idea is using it just against titles :) I think it is still much better and faster at retrieving certain kinds of information than not having it there.

Arantor

I think you'd want that to be a configurable option, there are usecases this would work really well in - I think your site would *definitely* benefit. But this site for example it would actually be counterproductive.

spiros

Well, certainly an option. Just imagine it this way: you get 5-10 suggestions based on your search, if you see something that is immediately helpful you go directly to that topic, failing that, you just hit Enter to get the full search results. I see that as helpful without disrupting the traditional search workflow. It can actually save time.

To give you an example, if I want to search here how to increase subject length I may get a lot of irrelevant posts with the traditional search, with autocomplete (searching in topics only), I can find and access what I want much faster.

Arantor

It *can* but in practice for most people *doesn't tend to*, having observed this in practice in other contexts. It relies too much on people putting in good titles which in a lot of communities just isn't what people do.

spiros

Yes, you are bound to get a lot of "help with a..." topics. Still, I find it better than nothing since it is an extra functionality and not the whole functionality.

Sesquipedalian

Quote from: Arantor on April 09, 2022, 09:10:29 AMI think it's time to talk about rethinking search.

[snip...]

All of these suggestions make sense to me.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

spiros

Well, at the very least 2.1 does away with the dreaded Each word in your search query must be at least two characters long...

Col

Is there any reasonable option for adding ElasticSearch or Sphinx to an SMF 2.1 forum?

Arantor

I assume the Sphinx connector was updated because I assume it's still in use here.

There are no connectors for ES at this time and honestly, so much more work needs to happen under the hood to make it in any way worthwhile.

Col

Quote from: Arantor on February 21, 2023, 10:34:42 AMI assume the Sphinx connector was updated because I assume it's still in use here.

There are no connectors for ES at this time and honestly, so much more work needs to happen under the hood to make it in any way worthwhile.
Is that an SMF mod, Arantor? If it is, I cannot find when I search there.

Arantor


Col


spiros

Quote from: Col on February 21, 2023, 10:32:22 AMIs there any reasonable option for adding ElasticSearch or Sphinx to an SMF 2.1 forum?

You may not have access (Big Forum Discussion), but it was brought up in 2013 https://www.simplemachines.org/community/index.php?topic=494134.0

Given the speed and functionality of Sphinx (or, even better, Manticore) ElasticSearch would be a slower overkill.

Col

Quote from: spiros on February 23, 2023, 02:56:49 AM
Quote from: Col on February 21, 2023, 10:32:22 AMIs there any reasonable option for adding ElasticSearch or Sphinx to an SMF 2.1 forum?

You may not have access (Big Forum Discussion), but it was brought up in 2013 https://www.simplemachines.org/community/index.php?topic=494134.0

Given the speed and functionality of Sphinx (or, even better, Manticore) ElasticSearch would be a slower overkill.
I have access to the Big Board area. I find the Sphinx/Manticore solution interesting. But if I understand correctly, the search database must be totally rebuilt to update it. Yes? How long does that take for several million posts? I assume this would require significant downtime maintenance for a forum with several million posts. So this is probably a non-starter for me. Hopefully I've misunderstood how the search database is updated.

Arantor

That's kind of one of the problems: while Sphinx has supported real time index updates for years, the connector doesn't. Even if SphinxQL supports it (which is the reaosn in the Sphinx bridge for why it's listed as not supporting it), the connector doesn't.

Solutions for this were documented in the Sphinx docs whereby you build the main index, then have a delta index so that you have the bulk of content in the main index which you update infrequently, and then have recent stuff covered by the delta index.

Col

Quote from: Arantor on February 23, 2023, 05:52:19 PMThat's kind of one of the problems: while Sphinx has supported real time index updates for years, the connector doesn't. Even if SphinxQL supports it (which is the reaosn in the Sphinx bridge for why it's listed as not supporting it), the connector doesn't.

Solutions for this were documented in the Sphinx docs whereby you build the main index, then have a delta index so that you have the bulk of content in the main index which you update infrequently, and then have recent stuff covered by the delta index.
So, if I understand you correctly, the delta index can be rerun quite quickly (minimal interruption). And then, once in a while, a total rebuild of the search database using the whole post table. Yeah, I can see that being more workable, but it remains messy and disruptive (even if to a lesser degree).

It would be great for SMF to have an integrated solution for this. Is it on the SMF 2.2 (3.0) wish list? Not that I would expect a new version of SMF for a very long time.

Arantor

Yes, the intent for the delta index was to rerun it every 15 minutes. It is all in the docs.

Quote from: Col on February 23, 2023, 06:09:39 PMIs it on the SMF 2.2 (3.0) wish list?

I have no idea. I've been asking for the roadmap for almost a year. There was even a period I was prepared to actually contribute code, but since I have no idea what's actually in the roadmap (I'm told there is one), I've sort of given up waiting to learn what is worth actually contributing.

Advertisement: