[infinispan-dev] Feedback on Infinispan patch

Tue Sep 22 18:27:22 EDT 2009

Hi Łukasz,
I don't understand if you are referring to real exceptions you have
got; have you experienced a stacktrace or verified some problem, or
are you thinking about it being a possibility?
I don't think the initial design is wrong because the IndexReaders are
designed to be used concurrently with an IndexWriter making changes on
filesystem.
Additionally Infinispan is transactional and defaults to
READ_COMMITTED, you could change this to REPEATABLE_READ so to freeze
the world from an IndexReader's point of view, but this shouldn't be
needed (and is actually harmful: more below)
The IndexReader has it's own transaction, or could be sharing one with
the writer in case of same thread, but in this case there's no
concurrency. So assuming te IR is running in it's own transaction, it
will not be affected by any change.

It's legal with Lucene to use commit() more than once on the same
IndexWriter (think of it as a "flush"); obviously it's not legal to
commit more than once an Infinispan (or bound CMT) transaction, so if
you end up using Infinispan.commit at each IW.commit make sure you
also start a new IF transaction to correctly wrap eventual additional
changes.

Thinking about how "normal" FS-based Lucene works: an IndexWriter
might do any change, and concurrently any IndexReader might be used
for queries: this is because the general contract of an IW is to not
change existing segments but add new ones (when not forced to merge
segments or optimize.. but again the IR is designed to work in this
case).

Looking into org.apache.lucene.index.MultiSegmentReader, line 53,
there is one remote problem possible: the reader needs to load more
than one resource atomically (I mean coming all from the same index
version); I don't know if it's possible to do this with IF (I believe
not?) and as you can see the chances for this problem to happen are
minimal, but it means it could just blow up and there is no solution
for it; additionally when using Hibernate Search it can't happen as
the index is only opened once at the framework's startup (see
SharingBufferReaderProvider): before any change to the index is made
possible (and after that it uses reopen() which is not affected).
Theoretically it could happen when a new node joins an existing
cluster, but is very unlikely and we could implement a retry strategy.
The SharingBufferReaderProvider is also the reason to avoid
REPEATABLE_READ: the IndexReader needs to be able to read a fresh copy
of SegmentInfos; you might use this isolation level with the simpler
NotSharedReaderProvider.

Of course this is theory, in practice if you've a failing test please share it!

Sanne

2009/9/22 Łukasz Moreń <lukasz.moren at gmail.com>:
> I need to provide this same lifecycle for IndexWriter as for Infinispan tx -
> IW is created: tx is started, IW is commited: tx is commited. It assures
> that IndexReader doesn't read old data from directory.
> Infinispan transaction can be started when IW acquires the lock, but its
> commit on IW lock release, as it is done so far, causes a problem:
>
> index writer close {
>   index writer commit(); //changes are visible for IndexReaders
>
>        //Index reader starts reading here, i.e. tries to access file "A"
>
>   index writer lockRelease(); //changes in Infinispan directory are
> commited, file "A" was removed, IndexReader cannot find it and crashes
> }
>
> I think Infinispan tx have to be commited just before IW commit, and the
> problem is where to put in code.
>
> W dniu 22 września 2009 18:24 użytkownik Emmanuel Bernard
> <emmanuel at hibernate.org> napisał:
>>
>> Can you explain in more details what is going on.
>> Aside from that Workspace has been Sanne's baby lately so he will be the
>> best to see what design will work in HSearch. That being said, I don't like
>> the idea of subclassing / overriding very much. In my experience, it has
>> lead to more bad and unmaintainable code than anything else.
>> On 22 sept. 09, at 02:16, Łukasz Moreń wrote:
>>
>> Hi,
>>
>> Thanks for explanation.
>> Maybe better I will concentrate on the first release and postpone
>> distributed writing.
>>
>> There is already LockStrategy that uses Infinispan. With using it I was
>> wrapping changes made by IndexWriter in Infinispan transaction, because of
>> performance reasons -
>> on lock obtaining transaction was started, on lock release transaction was
>> commited. Hovewer Ispn transaction commit on lock release is not good idea
>> since IndexWriter calls index commit before lock is released(and ispn
>> transaction is committed).
>> I was thinking to override Workspace class and getIndexWriter(start
>> infinispan tx), commitIndexWriter (commit tx) methods to wrap IndexWrite
>> lifecycle, but this needs few other changes. Some other ideas?
>>
>> Cheers,
>> Lukasz
>>
>> 2009/9/21 Sanne Grinovero <sanne.grinovero at gmail.com>
>>>
>>> Hi Łukasz,
>>> you've rightful concerns, because the way the IndexWriter tries to
>>> achieve the lock
>>> that will bring some trouble; As far as I remember we decided in this
>>> first release
>>> to avoid multiple writer nodes because of this reasons
>>> (that's written in your docs?)
>>>
>>> Actually it shouldn't be very hard to do, as the LockStrategy is
>>> pluggable (see changes from HSEARCH-345)
>>> and you could implement one delegating to an Infinispan eager lock on
>>> some key,
>>> like the default LockStrategy takes a file lock in the index directory.
>>>
>>> Maybe it's simpler to support this distributed writing instead of
>>> sending the queue to some single
>>> (elected) node? Would be cool, as the Document Analysis effort would
>>> be distributed,
>>> but I have no idea if this would be more or less efficient than a
>>> single node writing; it could
>>> bring some huge data transfers along the wire during segments merging
>>> (basically fetching
>>> the whole index data at each node performing a segment merge); maybe
>>> you'll need to
>>> play with IndexWriter settings (
>>>
>>> http://docs.jboss.org/hibernate/stable/search/reference/en/html_single/#lucene-indexing-performance
>>> )
>>> probably need to find the sweet spot for "merge_factor".
>>> I just saw now that MergePolicy is now re-implementable, but I hope
>>> that won't be needed.
>>>
>>> Sanne
>>>
>>> 2009/9/21 Łukasz Moreń <lukasz.moren at gmail.com>:
>>> > Hi,
>>> >
>>> > I'm wondering if it is reasonable to have multiple threads/nodes that
>>> > modifies indexes in Lucene Directory based on Infinispan? Let's assume
>>> > that
>>> > two nodes try to update index in this same time. First one creates
>>> > IndexWriter and obtains
>>> > write lock. There is high propability that second node throws
>>> > LockObtainFailedException (as one IndexWriter is allowed on single
>>> > index)
>>> > and index is not modified. How is that? Should be always only one node
>>> > that
>>> > makes changes in
>>> > the index?
>>> >
>>> > Cheers,
>>> > Lukasz
>>> >
>>> > W dniu 15 września 2009 01:39 użytkownik Łukasz Moreń
>>> > <lukasz.moren at gmail.com> napisał:
>>> >>
>>> >> Hi,
>>> >>
>>> >> With using JMeter I wanted to check if Infinispan dir does not crash
>>> >> under
>>> >> heavy load in "real" use and check performance in comparison with
>>> >> none/other
>>> >> directories.
>>> >> However appeared problem when multiple IndexWriters tries to modify
>>> >> index
>>> >> (test InfinispanDirectoryTest) - random deadlocks, and Lucene
>>> >> exceptions.
>>> >> IndexWriter tries to access files in index that were removed before.
>>> >> I'm
>>> >> looking into it, but not having good idea.
>>> >>
>>> >> Concerning the last part, I think similar thing is done in
>>> >> InfinispanDirectoryProviderTest. Many threads are making changes and
>>> >> searching (not checking if db is in sync with index).
>>> >> If threads finish their work, with Lucene query I'm checking if index
>>> >> contains as many results as expected. Maybe you meant something else?
>>> >> Would be good to run each node in different VM.
>>> >>
>>> >>> Great ! Looking forward to it. What state are things in at the moment
>>> >>> if I want to play around with it ?
>>> >>
>>> >> Should work with with one master(updates index) and one many slave
>>> >> nodes
>>> >> (sends changes to master). I tried with one master and one slave (both
>>> >> with
>>> >> jms and jgroups backend) and worked ok. Still fails if multiple nodes
>>> >> want
>>> >> to modify index.
>>> >>
>>> >> I've attached patch with current version.
>>> >>
>>> >> Cheers,
>>> >> Łukasz
>>> >>
>>> >> 2009/9/13 Michael Neale <michael.neale at gmail.com>
>>> >>>
>>> >>> Great ! Looking forward to it. What state are things in at the moment
>>> >>> if I want to play around with it ?
>>> >>>
>>> >>> Sent from my phone.
>>> >>>
>>> >>> On 13/09/2009, at 7:26 PM, Sanne Grinovero
>>> >>> <sanne.grinovero at gmail.com>
>>> >>> wrote:
>>> >>>
>>> >>> > 2009/9/12 Michael Neale <michael.neale at gmail.com>:
>>> >>> >> That does sounds pretty cool. Would be nice if the lucene indexes
>>> >>> >> could scale along with how people will want to use infinispan.
>>> >>> >> Probably worth playing with.
>>> >>> >
>>> >>> > Sure, this is the goal of Łukasz's work; We know compass has
>>> >>> > some good Directories, but we're building our own as one based
>>> >>> > on Infinispan is not yet available.
>>> >>> >
>>> >>> >>
>>> >>> >> Sent from my phone.
>>> >>> >>
>>> >>> >> On 13/09/2009, at 8:37 AM, Jeff Ramsdale <jeff.ramsdale at gmail.com>
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >>> I'm afraid I haven't followed the Infinispan-Lucene
>>> >>> >>> implementation
>>> >>> >>> closely, but have you looked at the Compass Project?
>>> >>> >>> (http://www.compass-project.org/overview.html) It provides a
>>> >>> >>> simplified interface to Lucene (optional) as well as Directory
>>> >>> >>> implementations built on Terracotta, Gigaspaces and Coherence.
>>> >>> >>> The
>>> >>> >>> latter, in particular, might be a useful guide for the Infinispan
>>> >>> >>> implementation. I believe it's mature enough to have solved many
>>> >>> >>> of
>>> >>> >>> the most difficult problems of implementing Directory on a
>>> >>> >>> distributed
>>> >>> >>> Map.
>>> >>> >>>
>>> >>> >>> If someone has any experience with Compass (particularly it's
>>> >>> >>> Directory implementations) I'd be interested in hearing about
>>> >>> >>> it...
>>> >>> >>> It's Apache 2.0 licensed, btw.
>>> >>> >>>
>>> >>> >>> -jeff
>>> >>> >>> _______________________________________________
>>> >>> >>> infinispan-dev mailing list
>>> >>> >>> infinispan-dev at lists.jboss.org
>>> >>> >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> >>> >> _______________________________________________
>>> >>> >> infinispan-dev mailing list
>>> >>> >> infinispan-dev at lists.jboss.org
>>> >>> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> >>> >>
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > infinispan-dev mailing list
>>> >>> > infinispan-dev at lists.jboss.org
>>> >>> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> >>>
>>> >>> _______________________________________________
>>> >>> infinispan-dev mailing list
>>> >>> infinispan-dev at lists.jboss.org
>>> >>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> >
>>> >
>>
>>
>
>