[hibernate-dev] Re: [Search] making updates to the indexes concurrently

Thursday, 20 November 2008

Sorry I wrote something stupid:
"3)use a size equal to the number of DirectoryProviders (this is the
optimal value, could be the default and be overriden by a parameter)"

is not true, this is not related to the optimal value. please discard
this option.
I think the best option would be to have a separate
Executor for each directory provider: otherwise it could happen
that a slowly reacting index could block correct operation from others,
as many queues could pileup targeting the same DP and exausting
the threads, which would all stay locked and collapse to a single
threaded model.

This makes me think we should need to create one executor
per DP, each one using just one thread: additional benefit is
that no locking would be needed, we can remove all barriers
in the backend (unless batch mode enables concurrent usage of the IndexWriter)

-- Sanne

2008/11/20 Sanne Grinovero <sanne.grinovero(a)gmail.com&gt;:
...
 Hello,
 because of HSEARCH-268( optimize indexes in parallel ) but also for
 other purposes, I am in need to define a new ThreadPool in Hibernate
 Search's
 Lucene backend.
 The final effect will actually be that all changes to indexes are
 going to be performed in parallel (on different indexes).
 I consider this a major improvement, and is currently easy to
 implement, iff we solve the following problems.

 The question is about how to size it properly, and how should the
 parallel workers interact, especially regarding commit failures and
 rollbacks:

 about the size
 =========
 I've considered some options:
 1) "steal" the configuration setting from BatchedQueueingProcessor,
 transforming that implementation in singlethreaded,
 and reusing the parameter internally to the Lucene backend only (JMS
 doesn't need it AFAIK).
 I'm afraid this could break custom made backends configuration parsing.

 2)add a new parameter to the environment

 3)use a size equal to the number of DirectoryProviders (this is the
 optimal value, could be the default and be overriden by a parameter).

 4)change the contract of BackendQueueProcessorFactory: instead of
 returning one Runnable it returns a list of Runnables,
 so it's possible to use the existing Executor.
 This needs some consideration about how different Runnables have to
 "join the same TX"; The JMS implementation could return just one
 Runnable, so no worry about that.

 about transactions
 ============
 As you know Search is not using a two phase commit between DB and
 Index, but Emmanuel has a very cool vision about that: we could add
 that later.
 The problem is: what to do if a Lucene index update fails (e.g. index
 A is corrupted),
 should we cancel the tasks going to make changes to the other indexes, B and C?
 That would be possible, but I don't think that you like that: after
 all the database changes are committed already, so I should actually
 make a "best effort" to update all indexes which are still working correctly.

 Another option would be to make the changes to all indexes, and then
 IndexWriter.commit() them all after they are all done.
 This is the opposite of the previous example, and also more complex to
 implement.
 I personally don't like this, but would like to hear more voices as it
 is an important matter.

 I think Search should work on a "best effort" criteria for next
 release: update all indexes it is able to.
 In a future one we could add an option to make it "two phase"
 optionally) by playing with the new
 Lucene commit() capabilities, but this would only make sense if you
 actually wanted to rollback
 the database changes in case of an index failure.

 sharing IndexWriter in batch mode
 =====================
 this is not needed for HSEARCH-268( optimize indexes in parallel ) but
 is needed to get a major boost in indexing performance.
 Currently the IndexWriter lifecycle is coupled to the operations done
 in a transaction; (also Emmanuel reminded me
 we need to release the file lock ASAP as a supported configuration is
 to use two Search instances sharing the same FS-based index).
 We already have the concept of "batch operation" and "transactional
 operation"; the only difference is currently about
 which tuning settings are applied to the IndexWriter.
 My idea is to extend the semantics of "batch mode" to mean a state
 which is globally affecting the way IndexWriters
 are aquired and released: when in batch mode, the IndexWriter is not
 closed at the end of each work queue, and the locks are not used:
 the IndexWriter could be shared across different threads. This is not
 transactionally safe of course, but that's why this is called
 "batch mode" opposing to "transactional mode": nobody would expect
 transactional behaviour.
 There should be taken care to revert the status to "transaction mode"
 and close the IndexWriter at the end, but this API
 would make me reindex the database using the "parallel
 scrollableresults" in the most efficient way, and nicely integrated.
 This isn't as complicated to implement as it is to explain;-)

 Sanne

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[hibernate-dev] Re: [Search] making updates to the indexes concurrently