Many excellent ideas here, thanks.
Answering to brought up concepts in reverse order:
# Depending on Solr
Yes I'm indeed very happy of having removed this dependency; not least
from a product perspective it forced us to address security issues in
its various web/ servlet components: strict qa checks happen on
anything we need to ship, but this was totally not relevant with how
we expect people to use our framework so just a parasyte for our time
which can otherwise be dedicated to more interesting aspects.
If some complex machinery from Solr is needed to take full advantage
of DisMax even outside of the scope of Solr server, ultimately we
should propose patches to Apache Lucene to move these into a more
suited lucene-query.jar but we can certainly start playing with it by
reimplementing or copying a couple of classes.
# Providing DSL support for DisjunctionMaxQuery
Yes I agree it's very interesting and not necessarily coupled to MLT:
it has its own issue HSEARCH-665 and I didn't mean to suggest MLT
requires DisMax, sorry for the confusion. Let's treat HSEARCH-665
indipendently: not a blocker for MLT.
Guillaume: sounds like you have solid experience with this feature. If
you are still considering the option of coaching an intern on such a
subject, consider that the Hibernate project participates in GSOC [1]
so we could get a paid for smart student. It's a bit late but we still
have time to suggest subjects for this year: if you or anyone else is
interested to be a mentor this year, please get in touch with me.
I don't think implementing just DisMax support is having enough meat
to keep a good student busy for months, but it could be one aspect of
a slightly more complex goal.
# Bringing MLT home
I still suspect that a DisMax approach would provide a better scoring
model but this is an implementation detail we should iterate on at a
second phase.
Essentially taking the example of "albino elephants" I agree on the
behaviour you described but I think there are some additional aspects
to consider when you're evaluating how a partial match "albino" scores
against a full match "albino elephant" in a single field, rather than
split up, or how "albino" could score less in field A rather than
field B, so even swapping positions of termson different fields could
provide a less valuable match.
Probably better explained with an example on a larger data set but
alas I won't be able to craft one soon.. still it's not a blocker at
all as in this first phase I think we should 1) have a working
solution 2) focus on API effectiveness. Performance and a sofisticated
scoring system will necessarily have to follow: I'm unpacking a large
data set to play with, I'm pretty sure we'll have plenty of follow up
improvements.
Emmanuel: if you can address the TODOs in the pull I'd merge it; if
you don't have time for that, could we work on top of your commits?
-- Sanne
1 -
https://community.jboss.org/wiki/GSOC13Ideas#jive_content_id_Hibernate