[hibernate-dev] Re: [Search] making updates to the indexes concurrently

Thu Nov 20 18:58:53 EST 2008

Sorry I wrote something stupid:
"3)use a size equal to the number of DirectoryProviders (this is the
optimal value, could be the default and be overriden by a parameter)"

is not true, this is not related to the optimal value. please discard
this option.
I think the best option would be to have a separate
Executor for each directory provider: otherwise it could happen
that a slowly reacting index could block correct operation from others,
as many queues could pileup targeting the same DP and exausting
the threads, which would all stay locked and collapse to a single
threaded model.

This makes me think we should need to create one executor
per DP, each one using just one thread: additional benefit is
that no locking would be needed, we can remove all barriers
in the backend (unless batch mode enables concurrent usage of the IndexWriter)

-- Sanne

2008/11/20 Sanne Grinovero <sanne.grinovero at gmail.com>:
> Hello,
> because of HSEARCH-268( optimize indexes in parallel ) but also for
> other purposes, I am in need to define a new ThreadPool in Hibernate
> Search's
> Lucene backend.
> The final effect will actually be that all changes to indexes are
> going to be performed in parallel (on different indexes).
> I consider this a major improvement, and is currently easy to
> implement, iff we solve the following problems.
>
> The question is about how to size it properly, and how should the
> parallel workers interact, especially regarding commit failures and
> rollbacks:
>
> about the size
> =========
> I've considered some options:
> 1) "steal" the configuration setting from BatchedQueueingProcessor,
> transforming that implementation in singlethreaded,
> and reusing the parameter internally to the Lucene backend only (JMS
> doesn't need it AFAIK).
> I'm afraid this could break custom made backends configuration parsing.
>
> 2)add a new parameter to the environment
>
> 3)use a size equal to the number of DirectoryProviders (this is the
> optimal value, could be the default and be overriden by a parameter).
>
> 4)change the contract of BackendQueueProcessorFactory: instead of
> returning one Runnable it returns a list of Runnables,
> so it's possible to use the existing Executor.
> This needs some consideration about how different Runnables have to
> "join the same TX"; The JMS implementation could return just one
> Runnable, so no worry about that.
>
> about transactions
> ============
> As you know Search is not using a two phase commit between DB and
> Index, but Emmanuel has a very cool vision about that: we could add
> that later.
> The problem is: what to do if a Lucene index update fails (e.g. index
> A is corrupted),
> should we cancel the tasks going to make changes to the other indexes, B and C?
> That would be possible, but I don't think that you like that: after
> all the database changes are committed already, so I should actually
> make a "best effort" to update all indexes which are still working correctly.
>
> Another option would be to make the changes to all indexes, and then
> IndexWriter.commit() them all after they are all done.
> This is the opposite of the previous example, and also more complex to
> implement.
> I personally don't like this, but would like to hear more voices as it
> is an important matter.
>
> I think Search should work on a "best effort" criteria for next
> release: update all indexes it is able to.
> In a future one we could add an option to make it "two phase"
> optionally) by playing with the new
> Lucene commit() capabilities, but this would only make sense if you
> actually wanted to rollback
> the database changes in case of an index failure.
>
> sharing IndexWriter in batch mode
> =====================
> this is not needed for HSEARCH-268( optimize indexes in parallel ) but
> is needed to get a major boost in indexing performance.
> Currently the IndexWriter lifecycle is coupled to the operations done
> in a transaction; (also Emmanuel reminded me
> we need to release the file lock ASAP as a supported configuration is
> to use two Search instances sharing the same FS-based index).
> We already have the concept of "batch operation" and "transactional
> operation"; the only difference is currently about
> which tuning settings are applied to the IndexWriter.
> My idea is to extend the semantics of "batch mode" to mean a state
> which is globally affecting the way IndexWriters
> are aquired and released: when in batch mode, the IndexWriter is not
> closed at the end of each work queue, and the locks are not used:
> the IndexWriter could be shared across different threads. This is not
> transactionally safe of course, but that's why this is called
> "batch mode" opposing to "transactional mode": nobody would expect
> transactional behaviour.
> There should be taken care to revert the status to "transaction mode"
> and close the IndexWriter at the end, but this API
> would make me reindex the database using the "parallel
> scrollableresults" in the most efficient way, and nicely integrated.
> This isn't as complicated to implement as it is to explain;-)
>
> Sanne
>