News:

Bored?  Looking to kill some time?  Want to chat with other SMF users?  Join us in IRC chat or Discord

Main Menu

Rethinking search

Started by Arantor, April 09, 2022, 09:10:29 AM

Previous topic - Next topic

Arantor

I think it's time to talk about rethinking search.

2.1 does sort of nibble at the edges of what should be done, but ultimately the whole things needs a rework. The rules have changed since 2006, the needs have changed too.


First up, getting rid of the custom index. The value of it is simply not what it once was. The lack of stemming support is also painful in usability, and the optimisation vs MySQL's own fulltext is no longer true in any case I can find.

Secondly, making it actually pluggable - bits of the code for the custom index and the fulltext index are intermashed throughout. Basically, it should be a proper API where the code simply says 'here's a new message, board 1, topic 2, message 3, its content is XYZ' and the search backend figures out how to put that into the index. Similarly the search frontend should then solely query the API.

Thirdly, supporting other content types. The search index is currently forum only, but making it know about different kinds of content wouldn't be particularly hard to do. Define content types (such as message), plus a scope (topic) and a permission boundary (board), and you can build a system that can cope with searching all the things from all the places. LevGal would totally have used this if it had been an option.

Fourth, the search index shouldn't just accept the bare content - ever. This is surprisingly ineffective in SMF in general if you're not careful, and renders a lot of things very unhelpful. What it should do is parse the bbcode, remove things that are quotes of other messages (where can reliably defined as such), flatten out the preserved spacing and index the content that's left. This also means, incidentally, that there's no bootstrapping of the index on top of the messages table so no fudging around of column type to support fulltext on older MySQL versions. (Side, side note: just make it a mediumtext already)

Fifth, the search system should support its own index by default (with none of this 'no index' nonsense), plus ideally ElasticSearch and maybe Sphinx; ES won the war even if Sphinx still wins on performance, just because ES is easier to throw things into.

Last, make the index rebuild a background task.


Most people won't see a significant alteration to their life, other than the folks who never had a search index before now getting more space usage, and search improves for everyone. On top of that the folks who do the modding thing should be able to hook into the search system meaning that content just becomes more discoverable by design.

Other than the effort of implementation I see no downside to any of this.

spiros

I once did make a recommendation on ES (used by Xenforo among others). You replied there too!

I would also like to see something like search autocomplete out of the box (for example with a separate index just for topic subjects). It is the way search is done nowadays. Bugo has written a nice mod to provide similar topics autocomplete when creating new topics, this could be enhanced so as to run on standard search as well as support Sphinx.

Arantor

Search autocomplete is actually massively hard to do. It only works well against certain kinds of corpus, and unless you're doing it just against titles (which isn't particularly useful - not even in 'similar topics' contexts), it's hard to do *well*.

The only reason Google does it even passably well is because they're basing it fairly heavily on what people search for, and can use the volume of search queries to improve that index - something the rest of us simply won't have.

spiros

The idea is using it just against titles :) I think it is still much better and faster at retrieving certain kinds of information than not having it there.

Arantor

I think you'd want that to be a configurable option, there are usecases this would work really well in - I think your site would *definitely* benefit. But this site for example it would actually be counterproductive.

spiros

Well, certainly an option. Just imagine it this way: you get 5-10 suggestions based on your search, if you see something that is immediately helpful you go directly to that topic, failing that, you just hit Enter to get the full search results. I see that as helpful without disrupting the traditional search workflow. It can actually save time.

To give you an example, if I want to search here how to increase subject length I may get a lot of irrelevant posts with the traditional search, with autocomplete (searching in topics only), I can find and access what I want much faster.

Arantor

It *can* but in practice for most people *doesn't tend to*, having observed this in practice in other contexts. It relies too much on people putting in good titles which in a lot of communities just isn't what people do.

spiros

Yes, you are bound to get a lot of "help with a..." topics. Still, I find it better than nothing since it is an extra functionality and not the whole functionality.

Sesquipedalian

Quote from: Arantor on April 09, 2022, 09:10:29 AMI think it's time to talk about rethinking search.

[snip...]

All of these suggestions make sense to me.
I promise you nothing.

Sesqu... Sesqui... what?
Sesquipedalian, the best word in the English language.

spiros

Well, at the very least 2.1 does away with the dreaded Each word in your search query must be at least two characters long...

Col

Is there any reasonable option for adding ElasticSearch or Sphinx to an SMF 2.1 forum?

Arantor

I assume the Sphinx connector was updated because I assume it's still in use here.

There are no connectors for ES at this time and honestly, so much more work needs to happen under the hood to make it in any way worthwhile.

Col

Quote from: Arantor on February 21, 2023, 10:34:42 AMI assume the Sphinx connector was updated because I assume it's still in use here.

There are no connectors for ES at this time and honestly, so much more work needs to happen under the hood to make it in any way worthwhile.
Is that an SMF mod, Arantor? If it is, I cannot find when I search there.

Arantor


Col


spiros

Quote from: Col on February 21, 2023, 10:32:22 AMIs there any reasonable option for adding ElasticSearch or Sphinx to an SMF 2.1 forum?

You may not have access (Big Forum Discussion), but it was brought up in 2013 https://www.simplemachines.org/community/index.php?topic=494134.0

Given the speed and functionality of Sphinx (or, even better, Manticore) ElasticSearch would be a slower overkill.

Col

Quote from: spiros on February 23, 2023, 02:56:49 AM
Quote from: Col on February 21, 2023, 10:32:22 AMIs there any reasonable option for adding ElasticSearch or Sphinx to an SMF 2.1 forum?

You may not have access (Big Forum Discussion), but it was brought up in 2013 https://www.simplemachines.org/community/index.php?topic=494134.0

Given the speed and functionality of Sphinx (or, even better, Manticore) ElasticSearch would be a slower overkill.
I have access to the Big Board area. I find the Sphinx/Manticore solution interesting. But if I understand correctly, the search database must be totally rebuilt to update it. Yes? How long does that take for several million posts? I assume this would require significant downtime maintenance for a forum with several million posts. So this is probably a non-starter for me. Hopefully I've misunderstood how the search database is updated.

Arantor

That's kind of one of the problems: while Sphinx has supported real time index updates for years, the connector doesn't. Even if SphinxQL supports it (which is the reaosn in the Sphinx bridge for why it's listed as not supporting it), the connector doesn't.

Solutions for this were documented in the Sphinx docs whereby you build the main index, then have a delta index so that you have the bulk of content in the main index which you update infrequently, and then have recent stuff covered by the delta index.

Col

Quote from: Arantor on February 23, 2023, 05:52:19 PMThat's kind of one of the problems: while Sphinx has supported real time index updates for years, the connector doesn't. Even if SphinxQL supports it (which is the reaosn in the Sphinx bridge for why it's listed as not supporting it), the connector doesn't.

Solutions for this were documented in the Sphinx docs whereby you build the main index, then have a delta index so that you have the bulk of content in the main index which you update infrequently, and then have recent stuff covered by the delta index.
So, if I understand you correctly, the delta index can be rerun quite quickly (minimal interruption). And then, once in a while, a total rebuild of the search database using the whole post table. Yeah, I can see that being more workable, but it remains messy and disruptive (even if to a lesser degree).

It would be great for SMF to have an integrated solution for this. Is it on the SMF 2.2 (3.0) wish list? Not that I would expect a new version of SMF for a very long time.

Arantor

Yes, the intent for the delta index was to rerun it every 15 minutes. It is all in the docs.

Quote from: Col on February 23, 2023, 06:09:39 PMIs it on the SMF 2.2 (3.0) wish list?

I have no idea. I've been asking for the roadmap for almost a year. There was even a period I was prepared to actually contribute code, but since I have no idea what's actually in the roadmap (I'm told there is one), I've sort of given up waiting to learn what is worth actually contributing.

spiros

Quote from: Col on February 23, 2023, 06:09:39 PMSo, if I understand you correctly, the delta index can be rerun quite quickly (minimal interruption). And then, once in a while, a total rebuild of the search database using the whole post table. Yeah, I can see that being more workable, but it remains messy and disruptive (even if to a lesser degree).

I have Delta run every 2 mins and main index once a day. Works just fine for me.

Col

Quote from: spiros on February 24, 2023, 04:30:24 AM
Quote from: Col on February 23, 2023, 06:09:39 PMSo, if I understand you correctly, the delta index can be rerun quite quickly (minimal interruption). And then, once in a while, a total rebuild of the search database using the whole post table. Yeah, I can see that being more workable, but it remains messy and disruptive (even if to a lesser degree).

I have Delta run every 2 mins and main index once a day. Works just fine for me.
How many forum posts in total; and how many posts per day? How long does the total rebuild take when there are a few million posts to reindex?

Arantor

Note that search isn't down while the re-index happens, you do the reindex then switch the indexes over. At least that's how it was when I tidied up the Sphinx docs on that subject (I was a docs contributor in 2008 or so...)

Col

I thought the delta index was the difference between the main index (of old posts) and new posts? My reading of your posts is that there are two full indexes, and they alternate? Yes?

How is the re-indexing triggered?

Arantor

Usually by a cron job, but it doesn't impact on search because search keeps using the old index until the indexer is done and sends a message to search to switch over to the new index (and remove the old one)

They haven't updated the docs for 3.x yet but the same docs in 2.x should be relevant: http://sphinxsearch.com/docs/manual-2.3.2.html#delta-updates

Col

Thanks, Arantor. That seems much more workable than I feared. I'll have read through that and see if it is something I can set up. Though, by the looks of it, if I cannot, I should be able to find someone to install and set it up and it should take care of itself.

Advertisement: