2008/9/8 Emmanuel Bernard <emmanuel(a)hibernate.org>:
On Sep 7, 2008, at 05:41, Sanne Grinovero wrote:
> The short question:
> may I add some methods to the implementations of LuceneWork?
> I'm refactoring the backends and it would help, but there
> is a warning there in the javadoc about not changing it freely.
>
> Sanne
The short answer is no, I don't think it should be needed. LuceneWork should
be the minimal contract needed when sending info across the wire. What
additional info do you need to forward?
I'll try keeping the serialized fields the same, I don't currently
need more information
across the wire but the code is helped a lot if I may add some
methods, no state change
so I suppose there should be no problem.
>
>
> The same question, a bit more verbose:
>
> Hi,
> I've been puzzling about several optimization in Search I would like
> to implement,
> but am needing to do some refactoring in the
> org.hibernate.search.backend package.
> (mostly done actually, but needing your ideas)
>
> Most changes affect "lucene" implementation, but the code would be
> greatly simplified,
> more readable and (performing better too IMHO) if I'm permitted to change
> the
> current implementations of LuceneWork; however there's a big warning there
> about a requirement to be backwards compatible with the serialized form.
> (btw OptimizeLuceneWork is missing the "magic serialization number")
optimize does not cross the wire
didn't see that, thanks. good idea actually.
>
> I would like to add them some methods, and a single field which could
> actually
> be transient so I could attempt to maintain the compatibility.
> Additionally I've been thinking that iff you like to keep the LuceneWork
> as
> a very simple transport and prefer to not add methods, it would be nicer
> to
> have just one class and have the AddLuceneWork/DeleteLuceneWork/... to
> differentiate
> by a field (using org.hibernate.search.backend.WorkType ?)
I am open to this approach. I initially created subclasses because the
necessary data was different between works.
Nice, but will keep the option as last resort.
I'm currently happy adding polymorhic methods, without state changes.
>
> to mark the different type of work; so I could add
> the methods I'm needing to the enum.
> Also I could see some use of having an UpdateLuceneWork too, so that it is
> the backend implementation's business to decide if he wants to split it in
> a
> delete+insert or do something more clever:
> the receive order of messages would be less critical and some clever
> optimizations
> could be applied by the backend by reordering received Work(s) or
> repackaging
> several queues in one.
Why would the order of message be less critical? Not sure what you mean by
critical as it's contained in a given work load.
I mean the order inside the same workload is less critical, currently
the updates
are split in two, then removes are done first.. that part.
I would like to use a "reorder engine" to tweak some stuff (is working very
nice
in alpha code locally); if I know it's an update the reorder engine is simpler,
otherwise it needs to detect operations on same entity.
>
> What I've done already:
> a)early division in different queues, basing on affected
> DirectoryProviders
>
> b)refactoring/simplification of Workspace, no longer needed to keep track
> of
> state for different DP as there is only one in the context.
>
> c)shorter Lock times: no threads ever need more than one Lock;
> work is sorted by DP, each lock is released before acquiring the next one.
> (deadlockFreeQueue is removed as not needed anymore)
> before if we needed lock on DP's A,B,C the time of acquisition looked
> like:
> Alock *********
> Block ******
> CLock ***
> now it is more like
> Alock ***
> Block ***
> Clock ***
> And my goal is to make this possible, in separate threads when async:
> Alock ***
> Block ***
> Clock ***
> (not implemented yet: will need a new backend, but I'm preparing the
> common
> stuff to make this possible)
>
> d)The QueueProcessor can ask the Work about if they need an indexwriter,
> indexreader or have any preference about one for when there is
> possibility to make a choice (when we open both a reader and writer
> anyway because of strict requirement of other Work in the same queue).
I partly follow you (a delete can be done by a writer in some situations)
but I don't quite understand why the work should describe that. What do you
gain?
just abstraction, the delete work has a method with some logic to give
the correct
answer, the add work just returns a constant value "I need the indexwriter".
The engine checks all work and then has to accomodate all needs,
and could avoid to open an IR or an IW.
>
>
> e)basing on d), DeleteLuceneWork is able to run either on reader or writer
> (when it's possible to do so, depending on (the number of different
> classes using the same DP) == 1); In this last case the work is able to
> tell it "prefers" to be executed on an IndexWriter, but will be able
> to do it's task with an IndexReader too (or the opposite?)
when would you need to still use the IR approach in that case?
for example if I have more work to do on an IR, but no other work
needing a IW; in this case I prefer to run the deletion by using an IR.
I am thinking to put overloaded methods to perform the work
doit( IR, other needed stuff);
doit( IW, other..);
where the AddWork could throw an unsupportedOperation on the wrong one,
the DeleteWork able to accomplish both (in some situations),
and in the case of IR can do it (internal choice) in two different ways
(depending on configuration) selecting the fastest non-broken option.
>
>
> f)"batch mode" is currently set on all DP if only one Work is of type
> batch,
> the division of Workspace per DP does not need this any more and batch
> mode can be set independently.
good to have the flexibility but I am not sure we will ever need that. This
case should not happen unless you merge queues from different transactions.
agree, not needed. It is more like a side-effect of the context
separation; I was thinking in
changing the code to achieve current behavior but thought it could be a welcomed
side-effect so it staid there.
>
> Another goal I have with this design is the possibility to aggregate
> different committed queues in one, having the possibility to
> optimize away work (insert then delete => noop) considering the original
> order,
hum total ordering is hard (on multi VM) and this case (insert then delete)
is probably very uncommon. (though it could happen if you execute the work
of a whole day at once ; but then you face memory issues to order queues).
Well I didn't implement it, just that after this work it would be
easier to do this
type of optimizations; there are a lot more you can think about.
(as detecting "update, update" on same entity, I suppose this actually happens
more frequently).
> but also call the strategy optimization again
> to reorder the newly created work for best efficiency.
> The final effect would be to obtain the same behavior of
> my custom batch indexer, but optimizing not only indexing from scratch
> but any type of load.
> I hope to not scare you, the resulting code is quite simple and I
> think there are actually less LOC than the current trunk has;
> I've not prepared any special case Test, I just run all existing ones.
let's try and chat on IM around that.
nice; I'm trying already and have organized my work this way:
* some refactoring of existing engine to cleanup code
* test it still is all working
* build an experimental QueueProcessor in a separate package
>
>
> kind regards,
> Sanne