Default search, Solr, importance of full-text search?
-
All the cool kids seem to be using ElasticSearch nowadays. Did you try something like:
https://github.com/q8888620002/nodebb-plugin-elasticsearchBut I agree with you on the importance on a properly functioning forum/site-search.
-
We had tried Solr and didn't find it any better than the default. They were both quite bad. So w settled on the default because it doesn't break from time to time like Solr did.
-
I maintain the Solr plugin, and I use "maintain" loosely, because it's been working so far, but for all intents and purposes we do prefer
dbsearch
oversolr
.Can you share any specific complaints about dbsearch? If there are obvious deficiencies, then we can address, but if your complaint is "it is not as good as Google", then I'm afraid I have some bad news for you...
-
@julian said in Default search, Solr, importance of full-text search?:
I maintain the Solr plugin, and I use "maintain" loosely, because it's been working so far, but for all intents and purposes we do prefer
dbsearch
oversolr
.Can you share any specific complaints about dbsearch
If there are obvious deficiencies, then we can address, but if your complaint is "it is not as good as Google", then I'm afraid I have some bad news for you...
I intentionally kept away from complaining about the forum search itself, because I know enough about full-text search to be sure it's not a side project that you do while passing. Just have a look at Solr, how many work has been put there and in my experience, it provides high-quality search results. No need for Google.
But I have now enabled dbsearch again for https://forums.bitfire.at/category/4/davdroid. Let's look at some actual search queries (taken from the logs):
- problem with gmx – dbsearch shows results for "problem with", while the more important part is "gmx". There are 533 results, all on the first page don't even contain "gmx". there are many postings that actually contain "gmx" and should be shown instead.
- google calendar – in this case, the first result is good, but the others are completely unrelated and other better-matching results are not shown
- http 405 – first result is "http not supported" and doesn't even contain "405". The other results are unrelated. Better threads like https://forums.bitfire.at/topic/977/no-addressbooks-or-calendars-were-found-baikal and https://forums.bitfire.at/topic/1153/cannot-get-service-discovery-working-synology-and-baikal/4 which contain "http" and "405" are not shown
- xiaomi no synchronization – only one result on the first page contains the most important keyword "xiaomi" in this query. The other results are completely unrelated. The results for only xiaomi are very different and better, but that's not what has been searched.
(Update: We have now switched to Solr again, the search results are now Solr results)
etc. etc. …
As said above, this is not a complaint about how bad dbsearch is, but I'm just asking whether it wouldn't be better to outsource the very complex matter of full-text search to a specialized engine (like Solr, which in my experience is able to provide much better results) and support that officially.
-
@baris we're using mongodb:
{ "db": "nodebb", "collections": 4, "objects": 398367, "avgObjSize": 198.98472263013753, "dataSize": 79268947, "storageSize": 51212288, "numExtents": 0, "indexes": 12, "indexSize": 35684352, "ok": 1, "mem": { "bits": 64, "resident": "0.485", "virtual": "0.753", "supported": true, "mapped": "0.000", "mappedWithJournal": 0 }, "collectionData": [ { "name": "nodebb.sessions", "count": 63393, "size": 15730504, "avgObjSize": 248, "storageSize": 13611008, "totalIndexSize": 4464640, "indexSizes": { "_id_": 3674112, "expires_1": 790528 } }, { "name": "nodebb.searchpost", "count": 8553, "size": 6685401, "avgObjSize": 781, "storageSize": 6701056, "totalIndexSize": 14737408, "indexSizes": { "_id_": 176128, "content_text_uid_1_cid_1": 14401536, "id_1": 159744 } }, { "name": "nodebb.objects", "count": 324940, "size": 56687082, "avgObjSize": 174, "storageSize": 30662656, "totalIndexSize": 16007168, "indexSizes": { "_id_": 4759552, "_key_1_value_-1": 5832704, "expireAt_1": 1290240, "_key_1_score_-1": 4124672 } }, { "name": "nodebb.searchtopic", "count": 1481, "size": 165960, "avgObjSize": 112, "storageSize": 237568, "totalIndexSize": 475136, "indexSizes": { "_id_": 53248, "id_1": 53248, "content_text_uid_1_cid_1": 368640 } } ], "network": { "bytesIn": 12537857266, "bytesOut": 31283903688, "numRequests": 50159150 } }
-
@rfc2822 The fact that Solr itself is subjectively (maybe objectively) better at search is actually the primary motivator for me creating the plugin in the first place.
However, the big downside is that Solr is a beast... it requires more resources than NodeBB itself does, and runs on Java+Tomcat, which I have next to no experience debugging
Are you sure you are sorting results by relevancy, and not by, say... post time?
-
@julian said in Default search, Solr, importance of full-text search?:
@rfc2822 The fact that Solr itself is subjectively (maybe objectively) better at search is actually the primary motivator for me creating the plugin in the first place.
However, the big downside is that Solr is a beast... it requires more resources than NodeBB itself does, and runs on Java+Tomcat, which I have next to no experience debugging
This is the main reason why I have really tried
dbsearch
for a long time before I have decided to give Solr a try. I didn't want it, had to setup an extra VM, learn the config file syntax etc… but at the end, I had search results which were good and I could recommend users to use the search function. Then I noticed that new postings were not indexed, and then I began to wonder how other people manage search, because it couldn't be such a rare requirement to have working full-text search?Are you sure you are sorting results by relevancy, and not by, say... post time?
I have just clicked on the default search icon on top of the page and entered the queries, as most people do. I have linked the queries in my previous posting for reference, just have a look. The (default) URL parameters seem to be "in=titlesposts&sortBy=relevance&sortDirection=desc&showAs=posts", so yes, these results should be sorted by descending relevance.
-
Yeah I think part of the problem is if you search for
http 403
it shows matches forhttp
or403
, searching for"http 403"
only returns one resultAlso searching in
Titles and posts
results in topic matches to be at the top so searching for just posts or just titles might lead to better results. -
But a normal user would expect a post with both
http
and403
to rank much higher than a post with only one of those terms. Actually I would only expect results with both terms.It would also be nice if age was taken into account when calculating relevancy. I'm currently only searching with sort set to 'last reply time'. The state of the world in 2014 isn't that relevant to me if there are newer search results.
-
Hah well that's a whole other conversation as to how you prefer to define "relevancy"
In some contexts, time and date is significent... in other contexts, maybe not.
That said, if Solr was working fine for you, and it wasn't indexing, then there was something wrong with the configuration... it should automatically index new posts, just like dbsearch does.
-
@julian said in Default search, Solr, importance of full-text search?:
That said, if Solr was working fine for you, and it wasn't indexing, then there was something wrong with the configuration... it should automatically index new posts, just like dbsearch does.
Well it didn't, although it said the opposite (I even tried to turn off the setting to be sure)
So… is there any chance that full-text search will be re-thinked again in NodeBB and my suggestions will be taken into consideration? Shall I open this topic elsewhere (issues, mailing list, …)
-
@bartvb I have just tried Elasticsearch. The default Elasticsearch plugin included by NodeBB doesn't even allow to (re-)index all posts … which means manual importing. No …
The question is: shall I try to improve the Elasticsearch plugin, creating the 10-th unmaintained fork which is then working on a single instance (mine) until the next major NodeBB update, while overall search is still horrible for 99% of NodeBB users? That's what I wanted to avoid. But it seems like interest in working full-text search is quite low
-
@rfc2822 My guess is that the NodeBB project gains the most when nodebb-plugin-dbsearch is improved. I did a quick check of the MongoDB docs and it seems like there is quite a bit of room for improvement when it comes to the implementation of MongoDB fulltext search in NodeBB.
Performance is not really the issue here, I'm guessing that normal MongoDB operations will come crashing down before MonogDB gives up on fulltext-search on really big boards. Which leaves search quality.
As far as I can tell things would already be quite a bit better if by default all searches use an AND operator between al terms and if a term like 'site.com' won't be broken up into 'site' and 'com'. And while you're at it it would be nice if relevance would take age into account
-
For reference: https://github.com/julianlam/nodebb-plugin-solr/issues/34
-
@bartvb said in Default search, Solr, importance of full-text search?:
@rfc2822 My guess is that the NodeBB project gains the most when nodebb-plugin-dbsearch is improved. I did a quick check of the MongoDB docs and it seems like there is quite a bit of room for improvement when it comes to the implementation of MongoDB fulltext search in NodeBB.
My experience with database full-text indices is bad. MySQL has a full-text index, too, and I have used it for several projects. It's OK, but the results are not as good as from specialized search engines as Solr. It confirms my prejudices when looking at what the default NodeBB Mongo-based full-text search produces. I also remember what happens when other projects (CMS like Typo3) try to build their own full-text search. It's just horrible and at the end, you still have to use a public search engine with
site:....
to find what you're looking for.Good full-text searching is an extremely complicated task. Why not leave it to specialized projects? I just don't understand why every project needs its own unusable low-quality full-text search instead of
- making use of public search engines (in the end, this is what most people do – redirect to Google), and
- making use of already available, specialized open-source search engines like Solr and Elastic Search if this is not an option?
Of course, the MongoDB full-text index could still be available, but not as only officially supported and advertised solution.
In my opinion, a side project of a database ("oh yes, full-text index would be cool, let's add it quickly") can never provide acceptable results because the topic is too complex. I also guess that the people who have developed Solr are not idiots who like to waste their time for nothing. (Ok, this applies to MongoDB people, too, but maybe the MongoDB full-text index is for use cases where high-quality full-text search is not as important as for a forum.)
-
The implementation here, is for a "good enough" solution for small boards where dbsearch is more than adequate.
While I admire your quest for a "good" search engine for NodeBB, one simply won't exist natively here, a third-party solution must be utilised. We will try our best to tweak the dbsearch results (e.g. implicit
AND
being one change we ought to adopt), but I personally am hoping for someone to develop an amazing search engine for Node.js so I can use the library as a moduleHowever that may well be unfeasible for a variety of reasons, Node.js not being the right tool being just one of those
-
@rfc2822 said in Default search, Solr, importance of full-text search?:
Good full-text searching is an extremely complicated task.
This is so true. I do believe you can approach Google-quality search results with a specialized search engine (because within a forum, you have a lot more context on what is important to your users), but it takes some science, metadata, and lots of tweaks to get good results. Good full text search isn't really just full text, it's a combination of a ton of other data, which is why Google is so good at it. The whitepaper on Google search is a taste of the complexity that goes into how they produce such good results.
-
@bri Searching a manageable number of postings (which should contain real text content, mostly without markup and all about a certain topic) should still be easier than indexing the whole Internet (and rank an endless number of pages which are highly "optimized" with SEO to be shown first etc.) Also, things like PageRank are not really applicable to a forum because a posting doesn't have to be linked often to be valuable.
Just had an idea: with Solr, even attachments (of postings) could be indexed
-
@rfc2822 said in Default search, Solr, importance of full-text search?:
a posting doesn't have to be linked often to be valuable
Well, yes, I was using that as an example. I didn't mean you need to literally use page rank in a forum search engine, that doesn't make sense.