[hibernate-dev] Re: Search: backend refactoring

Sat Sep 13 12:29:37 EDT 2008

during some tests for the refactoring I was doing I've found HSEARCH-263
which is currently blocking me;
could you give me some directions about how to solve it best?

Sanne

2008/9/8 Sanne Grinovero <sanne.grinovero at gmail.com>:
> 2008/9/8 Emmanuel Bernard <emmanuel at hibernate.org>:
>>
>>
>> On  Sep 7, 2008, at 05:41, Sanne Grinovero wrote:
>>
>>> The short question:
>>>        may I add some methods to the implementations of LuceneWork?
>>>        I'm refactoring the backends and it would help, but there
>>>        is a warning there in the javadoc about not changing it freely.
>>>
>>>        Sanne
>>
>> The short answer is no, I don't think it should be needed. LuceneWork should
>> be the minimal contract needed when sending info across the wire. What
>> additional info do you need to forward?
>
> I'll try keeping the serialized fields the same, I don't currently
> need more information
> across the wire but the code is helped a lot if I may add some
> methods, no state change
> so I suppose there should be no problem.
>
>>
>>>
>>>
>>> The same question, a bit more verbose:
>>>
>>> Hi,
>>> I've been puzzling about several optimization in Search I would like
>>> to implement,
>>> but am needing to do some refactoring in the
>>> org.hibernate.search.backend package.
>>> (mostly done actually, but needing your ideas)
>>>
>>> Most changes affect "lucene" implementation, but the code would be
>>> greatly simplified,
>>> more readable and (performing better too IMHO) if I'm permitted to change
>>> the
>>> current implementations of LuceneWork; however there's a big warning there
>>> about a requirement to be backwards compatible with the serialized form.
>>> (btw OptimizeLuceneWork is missing the "magic serialization number")
>>
>> optimize does not cross the wire
>
> didn't see that, thanks. good idea actually.
>
>>>
>>> I would like to add them some methods, and a single field which could
>>> actually
>>> be transient so I could attempt to maintain the compatibility.
>>> Additionally I've been thinking that iff you like to keep the LuceneWork
>>> as
>>> a very simple transport and prefer to not add methods, it would be nicer
>>> to
>>> have just one class and have the AddLuceneWork/DeleteLuceneWork/... to
>>> differentiate
>>> by a field (using org.hibernate.search.backend.WorkType ?)
>>
>> I am open to this approach. I initially created subclasses because the
>> necessary data was different between works.
>
> Nice, but will keep the option as last resort.
> I'm currently happy adding polymorhic methods, without state changes.
>
>>
>>>
>>> to mark the different type of work; so I could add
>>> the methods I'm needing to the enum.
>>> Also I could see some use of having an UpdateLuceneWork too, so that it is
>>> the backend implementation's business to decide if he wants to split it in
>>> a
>>> delete+insert or do something more clever:
>>> the receive order of messages would be less critical and some clever
>>> optimizations
>>> could be applied by the backend by reordering received Work(s) or
>>> repackaging
>>> several queues in one.
>>
>> Why would the order of message be less critical? Not sure what you mean by
>> critical as it's contained in a given work load.
>
> I mean the order inside the same workload is less critical, currently
> the updates
> are split in two, then removes are done first.. that part.
> I would like to use a "reorder engine" to tweak some stuff (is working very nice
> in alpha code locally); if I know it's an update the reorder engine is simpler,
> otherwise it needs to detect operations on same entity.
>
>>>
>>> What I've done already:
>>> a)early division in different queues, basing on affected
>>> DirectoryProviders
>>>
>>> b)refactoring/simplification of Workspace, no longer needed to keep track
>>> of
>>> state for different DP as there is only one in the context.
>>>
>>> c)shorter Lock times: no threads ever need more than one Lock;
>>> work is sorted by DP, each lock is released before acquiring the next one.
>>> (deadlockFreeQueue is removed as not needed anymore)
>>> before if we needed lock on DP's A,B,C the time of acquisition looked
>>> like:
>>> Alock *********
>>> Block    ******
>>> CLock       ***
>>> now it is more like
>>> Alock ***
>>> Block    ***
>>> Clock       ***
>>> And my goal is to make this possible, in separate threads when async:
>>> Alock ***
>>> Block ***
>>> Clock ***
>>> (not implemented yet: will need a new backend, but I'm preparing the
>>> common
>>> stuff to make this possible)
>>>
>>> d)The QueueProcessor can ask the Work about if they need an indexwriter,
>>> indexreader or have any preference about one for when there is
>>> possibility to make a choice (when we open both a reader and writer
>>> anyway because of strict requirement of other Work in the same queue).
>>
>> I partly follow you (a delete can be done by a writer in some situations)
>> but I don't quite understand why the work should describe that. What do you
>> gain?
>
> just abstraction, the delete work has a method with some logic to give
> the correct
> answer, the add work just returns a constant value "I need the indexwriter".
> The engine checks all work and then has to accomodate all needs,
> and could avoid to open an IR or an IW.
>
>>>
>>>
>>> e)basing on d), DeleteLuceneWork is able to run either on reader or writer
>>> (when it's possible to do so, depending on (the number of different
>>> classes using the same DP) == 1); In this last case the work is able to
>>> tell it "prefers" to be executed on an IndexWriter, but will be able
>>> to do it's task with an IndexReader too (or the opposite?)
>>
>> when would you need to still use the IR approach in that case?
>
> for example if I have more work to do on an IR, but no other work
> needing a IW; in this case I prefer to run the deletion by using an IR.
> I am thinking to put overloaded methods to perform the work
> doit( IR, other needed stuff);
> doit( IW, other..);
> where the AddWork could throw an unsupportedOperation on the wrong one,
> the DeleteWork able to accomplish both (in some situations),
> and in the case of IR can do it (internal choice) in two different ways
> (depending on configuration) selecting the fastest non-broken option.
>
>>>
>>>
>>> f)"batch mode" is currently set on all DP if only one Work is of type
>>> batch,
>>> the division of Workspace per DP does not need this any more and batch
>>> mode can be set independently.
>>
>> good to have the flexibility but I am not sure we will ever need that. This
>> case should not happen unless you merge queues from different transactions.
>
> agree, not needed. It is more like a side-effect of the context
> separation; I was thinking in
> changing the code to achieve current behavior but thought it could be a welcomed
> side-effect so it staid there.
>
>>>
>>> Another goal I have with this design is the possibility to aggregate
>>> different committed queues in one, having the possibility to
>>> optimize away work (insert then delete => noop) considering the original
>>> order,
>>
>> hum total ordering is hard (on multi VM) and this case (insert then delete)
>> is probably very uncommon. (though it could happen if you execute the work
>> of a whole day at once ; but then you face memory issues to order queues).
>
> Well I didn't implement it, just that after this work it would be
> easier to do this
> type of optimizations; there are a lot more you can think about.
> (as detecting "update, update" on same entity, I suppose this actually happens
> more frequently).
>
>>
>>> but also call the strategy optimization again
>>> to reorder the newly created work for best efficiency.
>>> The final effect would be to obtain the same behavior of
>>> my custom batch indexer, but optimizing not only indexing from scratch
>>> but any type of load.
>>> I hope to not scare you, the resulting code is quite simple and I
>>> think there are actually less LOC than the current trunk has;
>>> I've not prepared any special case Test, I just run all existing ones.
>>
>> let's try and chat on IM around that.
>
> nice; I'm trying already and have organized my work this way:
>  * some refactoring of existing engine to cleanup code
>  * test it still is all working
>  * build an experimental QueueProcessor in a separate package
>
>>
>>>
>>>
>>> kind regards,
>>> Sanne
>>
>>
>