Hello,
because of HSEARCH-268( optimize indexes in parallel ) but also for
other purposes, I am in need to define a new ThreadPool in Hibernate
Search's
Lucene backend.
The final effect will actually be that all changes to indexes are
going to be performed in parallel (on different indexes).
I consider this a major improvement, and is currently easy to
implement, iff we solve the following problems.
The question is about how to size it properly, and how should the
parallel workers interact, especially regarding commit failures and
rollbacks:
about the size
=========
I've considered some options:
1) "steal" the configuration setting from BatchedQueueingProcessor,
transforming that implementation in singlethreaded,
and reusing the parameter internally to the Lucene backend only (JMS
doesn't need it AFAIK).
I'm afraid this could break custom made backends configuration parsing.
2)add a new parameter to the environment
3)use a size equal to the number of DirectoryProviders (this is the
optimal value, could be the default and be overriden by a parameter).
4)change the contract of BackendQueueProcessorFactory: instead of
returning one Runnable it returns a list of Runnables,
so it's possible to use the existing Executor.
This needs some consideration about how different Runnables have to
"join the same TX"; The JMS implementation could return just one
Runnable, so no worry about that.
about transactions
============
As you know Search is not using a two phase commit between DB and
Index, but Emmanuel has a very cool vision about that: we could add
that later.
The problem is: what to do if a Lucene index update fails (e.g. index
A is corrupted),
should we cancel the tasks going to make changes to the other indexes, B and C?
That would be possible, but I don't think that you like that: after
all the database changes are committed already, so I should actually
make a "best effort" to update all indexes which are still working correctly.
Another option would be to make the changes to all indexes, and then
IndexWriter.commit() them all after they are all done.
This is the opposite of the previous example, and also more complex to
implement.
I personally don't like this, but would like to hear more voices as it
is an important matter.
I think Search should work on a "best effort" criteria for next
release: update all indexes it is able to.
In a future one we could add an option to make it "two phase"
optionally) by playing with the new
Lucene commit() capabilities, but this would only make sense if you
actually wanted to rollback
the database changes in case of an index failure.
sharing IndexWriter in batch mode
=====================
this is not needed for HSEARCH-268( optimize indexes in parallel ) but
is needed to get a major boost in indexing performance.
Currently the IndexWriter lifecycle is coupled to the operations done
in a transaction; (also Emmanuel reminded me
we need to release the file lock ASAP as a supported configuration is
to use two Search instances sharing the same FS-based index).
We already have the concept of "batch operation" and "transactional
operation"; the only difference is currently about
which tuning settings are applied to the IndexWriter.
My idea is to extend the semantics of "batch mode" to mean a state
which is globally affecting the way IndexWriters
are aquired and released: when in batch mode, the IndexWriter is not
closed at the end of each work queue, and the locks are not used:
the IndexWriter could be shared across different threads. This is not
transactionally safe of course, but that's why this is called
"batch mode" opposing to "transactional mode": nobody would expect
transactional behaviour.
There should be taken care to revert the status to "transaction mode"
and close the IndexWriter at the end, but this API
would make me reindex the database using the "parallel
scrollableresults" in the most efficient way, and nicely integrated.
This isn't as complicated to implement as it is to explain;-)
Sanne