[hibernate-dev] [Search] DisjunctionMaxQuery and MoreLikeThis

Thu Feb 20 04:19:19 EST 2014

I have been thinking about our initial idea to use DisjunctionMaxQuery
(aka DisMax) with MoreLikeThis instead of the Boolean query we have
today.

## Definition and landscape

DisMax lets you amongst a set of subqueries under a SHOULD clause boost
the matching documents up to the score of the highest subquery (and not
add up the score of each).
A concrete use case is as followed.  If the query is "albino elephant"
this ensures that "albino" matching one field and "elephant" matching
another gets a higher score than "albino" matching both fields.

Each term (albinos and elephant) has a DisMax query where the subqueries
are a term query for each targeted field. Then both DisMax queries are
joined with a regular boolean query.

In peusdo HSearch query DSL it would look like:

    .bool()
      .should(
        .dismax()
          .should(
            .keyword().onField("title").matching("Albinos")
          )
          .should(
            .keyword().onField("description").matching("Albinos")
          )
      )
      .should(
        .dismax()
          .should(
            .keyword().onField("title").matching("Elephant")
          )
          .should(
            .keyword().onField("description").matching("Elephant")
          )
      )

## More Like This (aka MLT)

Our more like this algorithm does the following.

- look for the term vectors of a document i
- for each field contained in document i (or a subset)
  - find the most popular terms the field f of document i
  - build a boolean query with the most popular terms on  field f
- combine these boolean queries per field into a bigger boolean query

The original Lucene more like this algorithm is a bit different in the
sense that it does not look for popular terms *per field* but rather
look for an all star popular term for document i and then build a
boolean query with the most popular term for each field.

## More Like This and DisMax

With our MLT approach, terms between fields are not necessarily
shared. In fact they are only looked for if they belong to the field f
of document i in the first place.
I don't see how DisMax would be of any use for us as we don't have a
common set of terms that we look for across several fields. At least not
to solve the now famous albinos elephant problem.

We could use Dismax for the final top boolean query. The effect would be
that documents are scored up to the highest lookalike-factor of their
best field as opposed to the cumulated lookalike-ness of each field.
Is that desirable? It does not look like it. I would naturally use boost
factors between fields to express their respective importance but still
want to find matching documents across all fields.

Thoughts?

## DisMax and our current keyword matching

It would make some sense I think to offer DisMax for our current keyword
matching queries.

    .keyword().onFields("title", "description").matching("Albinos Elephant")

In this case **and assuming the same analyzer for both fields**, we
could use DisMax to essentially do

    .bool()
      .should(
        .dismax()
          .should( keyword().onField("title").matching("Albinos") )
          .should( keyword().onField("description").matching("Albinos") )
      )
      .should(
        .dismax()
          .should( keyword().onField("title").matching("Elephant") )
          .should( keyword().onField("description").matching("Elephant") )
      )

I am not sure how we would call that effect?

- .favorMultipleKeywordMatching()
- .decreaseCrossFieldKeywordImportanceBy(90%) //this number is 1 - DisMax tieBreakMultiplier for the curious ; 100% is what I have described above

## DisMax as top level DSL feature

Should we add .dismax() like we did bool()?
I am hard pressed to find a use case.

Emmanuel