[hibernate-dev] Re: Search: backend refactoring

Mon Sep 8 10:26:26 EDT 2008

On  Sep 7, 2008, at 05:41, Sanne Grinovero wrote:

> The short question:
> 	may I add some methods to the implementations of LuceneWork?
> 	I'm refactoring the backends and it would help, but there
> 	is a warning there in the javadoc about not changing it freely.
>
> 	Sanne

The short answer is no, I don't think it should be needed. LuceneWork  
should be the minimal contract needed when sending info across the  
wire. What additional info do you need to forward?

>
>
> The same question, a bit more verbose:
>
> Hi,
> I've been puzzling about several optimization in Search I would like
> to implement,
> but am needing to do some refactoring in the
> org.hibernate.search.backend package.
> (mostly done actually, but needing your ideas)
>
> Most changes affect "lucene" implementation, but the code would be
> greatly simplified,
> more readable and (performing better too IMHO) if I'm permitted to  
> change the
> current implementations of LuceneWork; however there's a big warning  
> there
> about a requirement to be backwards compatible with the serialized  
> form.
> (btw OptimizeLuceneWork is missing the "magic serialization number")

optimize does not cross the wire

>
>
> I would like to add them some methods, and a single field which  
> could actually
> be transient so I could attempt to maintain the compatibility.
> Additionally I've been thinking that iff you like to keep the  
> LuceneWork as
> a very simple transport and prefer to not add methods, it would be  
> nicer to
> have just one class and have the AddLuceneWork/DeleteLuceneWork/... to
> differentiate
> by a field (using org.hibernate.search.backend.WorkType ?)

I am open to this approach. I initially created subclasses because the  
necessary data was different between works.

>
> to mark the different type of work; so I could add
> the methods I'm needing to the enum.
> Also I could see some use of having an UpdateLuceneWork too, so that  
> it is
> the backend implementation's business to decide if he wants to split  
> it in a
> delete+insert or do something more clever:
> the receive order of messages would be less critical and some clever
> optimizations
> could be applied by the backend by reordering received Work(s) or  
> repackaging
> several queues in one.

Why would the order of message be less critical? Not sure what you  
mean by critical as it's contained in a given work load.

>
>
> What I've done already:
> a)early division in different queues, basing on affected  
> DirectoryProviders
>
> b)refactoring/simplification of Workspace, no longer needed to keep  
> track of
> state for different DP as there is only one in the context.
>
> c)shorter Lock times: no threads ever need more than one Lock;
> work is sorted by DP, each lock is released before acquiring the  
> next one.
> (deadlockFreeQueue is removed as not needed anymore)
> before if we needed lock on DP's A,B,C the time of acquisition  
> looked like:
> Alock *********
> Block    ******
> CLock       ***
> now it is more like
> Alock ***
> Block    ***
> Clock       ***
> And my goal is to make this possible, in separate threads when async:
> Alock ***
> Block ***
> Clock ***
> (not implemented yet: will need a new backend, but I'm preparing the  
> common
> stuff to make this possible)
>
> d)The QueueProcessor can ask the Work about if they need an  
> indexwriter,
> indexreader or have any preference about one for when there is
> possibility to make a choice (when we open both a reader and writer
> anyway because of strict requirement of other Work in the same queue).

I partly follow you (a delete can be done by a writer in some  
situations) but I don't quite understand why the work should describe  
that. What do you gain?

>
>
> e)basing on d), DeleteLuceneWork is able to run either on reader or  
> writer
> (when it's possible to do so, depending on (the number of different
> classes using the same DP) == 1); In this last case the work is able  
> to
> tell it "prefers" to be executed on an IndexWriter, but will be able
> to do it's task with an IndexReader too (or the opposite?)

when would you need to still use the IR approach in that case?

>
>
> f)"batch mode" is currently set on all DP if only one Work is of  
> type batch,
> the division of Workspace per DP does not need this any more and batch
> mode can be set independently.

good to have the flexibility but I am not sure we will ever need that.  
This case should not happen unless you merge queues from different  
transactions.

>
>
> Another goal I have with this design is the possibility to aggregate
> different committed queues in one, having the possibility to
> optimize away work (insert then delete => noop) considering the  
> original
> order,

hum total ordering is hard (on multi VM) and this case (insert then  
delete) is probably very uncommon. (though it could happen if you  
execute the work of a whole day at once ; but then you face memory  
issues to order queues).

> but also call the strategy optimization again
> to reorder the newly created work for best efficiency.
> The final effect would be to obtain the same behavior of
> my custom batch indexer, but optimizing not only indexing from scratch
> but any type of load.
> I hope to not scare you, the resulting code is quite simple and I
> think there are actually less LOC than the current trunk has;
> I've not prepared any special case Test, I just run all existing ones.

let's try and chat on IM around that.

>
>
> kind regards,
> Sanne