[hibernate-dev] [Search] making updates to the indexes concurrently

Fri Nov 21 08:46:14 EST 2008

inline answers;

2008/11/21 Hardy Ferentschik <hibernate at ferentschik.de>:
> On Thu, 20 Nov 2008 21:14:16 +0100, Sanne Grinovero
> <sanne.grinovero at gmail.com> wrote:
>
>> because of HSEARCH-268( optimize indexes in parallel ) but also for
>> other purposes, I am in need to define a new ThreadPool in Hibernate
>> Search's Lucene backend.
>> The final effect will actually be that all changes to indexes are
>> going to be performed in parallel (on different indexes).
>
> So really, we should rename the Jira task to "apply index modifications
> in parallel", unless we are first tackling the optimize problem only.
yes I should rename it, it would be more difficult to isolate optimization
only than actually doing all work parallel.

>
>> about the size
>> =========
>> I've considered some options:
>> 1) "steal" the configuration setting from BatchedQueueingProcessor,
>> transforming that implementation in singlethreaded,
>> and reusing the parameter internally to the Lucene backend only (JMS
>> doesn't need it AFAIK).
>> I'm afraid this could break custom made backends configuration parsing.
>>
>> 2)add a new parameter to the environment
>
> Will this change change the way the work execution is configured? For
> example,
> if you set hibernate.search.worker.execution=async, does this mean that
> first of
> all the work queue itself gets processed by
> hibernate.search.worker.thread_pool.size number of threads
> and then there will be another thread (from another pool? same pool?) per
> directory provider applying the
> actual changes?
> If so we need two settings I believe since the default size of these two
> different thread pools
> will be different, right?
yes you got the point, this is why I'm writing here.
Actually the second pool size, as mentioned in other email, should be fixed
and so we don't need to add a new parameter. This is good for the book ;-)

>
>> about transactions
>> ============
>> As you know Search is not using a two phase commit between DB and
>> Index, but Emmanuel has a very cool vision about that: we could add
>> that later.
>> The problem is: what to do if a Lucene index update fails (e.g. index
>> A is corrupted),
>> should we cancel the tasks going to make changes to the other indexes, B
>> and C?
>> That would be possible, but I don't think that you like that: after
>> all the database changes are committed already, so I should actually
>> make a "best effort" to update all indexes which are still working
>> correctly.
>
> I think a best effort approach is the best for now. I assume we still throw
> an Exception
> if the index operation fails, right? In this case the user will have to
> reindex one way or
> the other anyway.
well an indexing time failure would kill a thread in the pool
and get logged of course, but will not "kill" the committing transaction.
This is the same semantics as of current working in async, so my guess is that
 it could stay that way at least for this release.
Actually in sync mode the behaviour would be different.. should I catch
the exception and pass it back to the committing thread?
This means the exception handling is currently different in sync or async too,
probably nobody is bothering with this detail?
>

> Will we be able to get this change in for the GA release?

yes the concept code is working here, need to do some more tests
and of course a "go" feedback from this list.

>
> --Hardy
>

thanks for helping,
Sanne