Design: HSEARCH-1032 MassIndexer with a live update mechanism

Friday, 12 July 2013

Current priorities on Search are:
 - Infinispan IndexManager -> me
 - Metadata API -> Hardy
 - Multitenancy (aka dynamic Sharding) -> me + Emmanuel + Dimitrios

Those are all important as they represent hard requirements for other
projects, but I'd also like to consider at least the basic design for
how the MassIndexer could operate in "update mode": a highly requested
mode in which it re-synchronizes the index with the database but
without wiping out the index, which creates a window in time of the
application in which results are not complete.

# Reminder on current design:
 1- deletes the current index
 2- scrolls on all entities and uses ADD index operations to add them all again

There are two basic approaches on the table (other ideas welcome) :
  - #A Use UPDATE index operations instead, skipping the initial delete
  - #B Rebuild the index in a secondary directory, then switch

Let's explore them:

#A Use UPDATE index operations instead, skipping the initial delete

## what
Technically an UPDATE operation is - in Lucene terms - an atomic
(delete+add); the benefit is that each query will either see the
previous document or the updated one, there is no possibility that the
doc is skipped as there is no possibility to flush the changes between
the delete and the add operation.

## performance
The reason the current design deletes all elements at the start of the
process, is that this is a very efficient operation: it targets a
single term (the class name field) or in some cases targets the whole
index, so just needs to delete all segments files.
When doing a delete operation on a per-document base, instead of a
class, that very likely needs a deletion on multiple terms (which is
not efficient at all as it needs to IO to seek across multiple disk
positions), and of course the worse point is that it triggers a delete
operation for each and every entity. To compare, a single ADD doesn't
need any disk seek as we can pack multiple operations in one - until
buffer is full - but any single delete requires N disk seeks (N is not
directly the number of fields but is proportional to it).
Based on this, and on experience with the #index() method
benchmarking, I'm expecting the UPDATE strategy to be approximately a
thousand times slower than the current MassIndexer implementation..
considering for some it takes a couple of hours, going to 2000 hours
is maybe not an option :-) (that's 3 months)

## left over entries
Another problem is that if we scroll on all entities from the
database, we're failing to delete documents in the index for which
there is no match anymore.
So we would need a final phase in which we run the inverse iteration:
for each element in the index, verify if there is a match in the
database; sounds like an ugly lot of queries, even if we batch it in
verification blocks.

bottomline, looks messy.

#B Rebuild the index in a secondary directory, then switch

## performance
No big concerns, but we assume there is enough space for at least four
times the size of the index (because we normally need twice to be able
to compact one, and we have two to manage).

## design
The good part is that we can reuse most of the existing MassIndexer;
but transactional changes (those applied by the application during a
reindexing) need to be redirected to both the indexes: the one being
used until the rebuild is complete so that the queries stay
consistent, and also enqueued into the one being built so that they
don't get lost in case they apply to documents which have already been
indexed. The queue handling is tricky, because in such case further
additions actually need to be updates, unless we can keep them on hold
in a buffer to be applied on the pristine index: could take quite some
memory, depending on the amount of changes flying in during the
massindexing. If the queue grows beyond reason we'll need to either
apply backpressure on the transactions or offload to disk or change to
an update strategy for the remaining massindexing process.. none of
these are desirable but I guess people could tune to make this
condition unlikely.

## SPI changes
With this design we need to be able to:
 - dynamically instantiate a second Directory in a different path
 - switch to delegate writes to both directories / one directory
 - control from where Readers are opened
 - make sure closed Readers go back to the original pool where they
come from as their reference source could have been changed
 - be able to switch (permanently) to a different active index
 - destroy old index

I'm afraid each of these can affect our SPIs; likely at least
IndexManager. I hope we can have all the logic in "behind the scenes"
code which drives the same SPIs as of today but I'd need a POC to
verify this.

## Directory index path
If we switch from one Directory to another - thinking about the
FSDirectory - we're either violating the path configuration options
from the user or we need to move the new index into the configured
position when done. If the above sounds a bit complex, I'm actually
more concerned about implementing such an atomic move on the
filesystem.
I guess we could agree that if the user configured an index to be in -
say - "/var/lucene/persons" we could store the indexes in
"/var/lucene/persons/index-a" and "/var/lucene/persons/index-b",
alternating in similar way to the FSMasterDirectoryProvider, but that
takes away some control on index position and is not backwards
compatible. Would this be acceptable?

# Timeline
This might need to be moved to 5.0 because of the various backwards
compatibility concerns - ideally if some community user feels to
participate we could share some early code in experimental branches
and work together.

Comments and better ideas welcome :)
Sanne

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006