[hibernate-dev] Search: backend refactoring

Sun Sep 7 05:41:13 EDT 2008

The short question:
	may I add some methods to the implementations of LuceneWork?
	I'm refactoring the backends and it would help, but there
	is a warning there in the javadoc about not changing it freely.

	Sanne

The same question, a bit more verbose:

Hi,
I've been puzzling about several optimization in Search I would like
to implement,
but am needing to do some refactoring in the
org.hibernate.search.backend package.
(mostly done actually, but needing your ideas)

Most changes affect "lucene" implementation, but the code would be
greatly simplified,
more readable and (performing better too IMHO) if I'm permitted to change the
current implementations of LuceneWork; however there's a big warning there
about a requirement to be backwards compatible with the serialized form.
(btw OptimizeLuceneWork is missing the "magic serialization number")

I would like to add them some methods, and a single field which could actually
be transient so I could attempt to maintain the compatibility.
Additionally I've been thinking that iff you like to keep the LuceneWork as
a very simple transport and prefer to not add methods, it would be nicer to
have just one class and have the AddLuceneWork/DeleteLuceneWork/... to
differentiate
by a field (using org.hibernate.search.backend.WorkType ?)
to mark the different type of work; so I could add
the methods I'm needing to the enum.
Also I could see some use of having an UpdateLuceneWork too, so that it is
the backend implementation's business to decide if he wants to split it in a
delete+insert or do something more clever:
the receive order of messages would be less critical and some clever
optimizations
could be applied by the backend by reordering received Work(s) or repackaging
several queues in one.

What I've done already:
a)early division in different queues, basing on affected DirectoryProviders

b)refactoring/simplification of Workspace, no longer needed to keep track of
state for different DP as there is only one in the context.

c)shorter Lock times: no threads ever need more than one Lock;
work is sorted by DP, each lock is released before acquiring the next one.
(deadlockFreeQueue is removed as not needed anymore)
before if we needed lock on DP's A,B,C the time of acquisition looked like:
Alock *********
Block    ******
CLock       ***
now it is more like
Alock ***
Block    ***
Clock       ***
And my goal is to make this possible, in separate threads when async:
Alock ***
Block ***
Clock ***
(not implemented yet: will need a new backend, but I'm preparing the common
stuff to make this possible)

d)The QueueProcessor can ask the Work about if they need an indexwriter,
indexreader or have any preference about one for when there is
possibility to make a choice (when we open both a reader and writer
anyway because of strict requirement of other Work in the same queue).

e)basing on d), DeleteLuceneWork is able to run either on reader or writer
(when it's possible to do so, depending on (the number of different
classes using the same DP) == 1); In this last case the work is able to
tell it "prefers" to be executed on an IndexWriter, but will be able
to do it's task with an IndexReader too (or the opposite?)

f)"batch mode" is currently set on all DP if only one Work is of type batch,
the division of Workspace per DP does not need this any more and batch
mode can be set independently.

Another goal I have with this design is the possibility to aggregate
different committed queues in one, having the possibility to
optimize away work (insert then delete => noop) considering the original
order, but also call the strategy optimization again
to reorder the newly created work for best efficiency.
The final effect would be to obtain the same behavior of
my custom batch indexer, but optimizing not only indexing from scratch
but any type of load.
I hope to not scare you, the resulting code is quite simple and I
think there are actually less LOC than the current trunk has;
I've not prepared any special case Test, I just run all existing ones.

kind regards,
Sanne