On 7 October 2013 10:44, Dan Berindei <dan.berindei(a)gmail.com> wrote:
On Mon, Oct 7, 2013 at 2:30 AM, Sanne Grinovero <sanne(a)infinispan.org>
wrote:
>
> On 6 October 2013 00:01, Pedro Ruivo <pedro(a)infinispan.org> wrote:
> > Hi Sanne.
> >
> > Thanks for your comments. please see inline...
> >
> > Cheers,
> > Pedro
> >
> >
> > On 10/05/2013 09:15 PM, Sanne Grinovero wrote:
> >>
> >> Hi Pedro,
> >> looks like you're diving in some good fun :-)
> >> BTW please keep the dev discussions on the mailing list, adding it.
> >>
> >> inline :
> >>
> >> On 4 October 2013 22:01, Pedro Ruivo <pedro(a)infinispan.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Sanne I need your expertise in here. I'm afraid that the problem is
in
> >>> FileListOperations :(
> >>> I think the FileListOperations implementation needs a transactional
> >>> cache
> >>> with strong consistency...
> >>>
> >>> I'm 99% sure that it is originating the java.lang.AssertionError:
file
> >>> XPTO
> >>> does not exist. I find out that we have multiple threads adding and
> >>> removing
> >>> files from the list. The scenario in [1] we see 2 threads loading the
> >>> key
> >>> from the cache loader and one thread adds a file and other removes.
> >>> the
> >>> thread that removes is the last one to commit and the file list is
> >>> updated
> >>> to an old state. When it tries to updat an index, I got the assertion
> >>> error.
> >>
> >>
> >> Nice, looks like you're on something.
> >> I've never seen specifically an AssertionError, looks like you have a
> >> new test. Could you share it?
> >
> >
> > yes of course:
> >
> >
https://github.com/pruivo/infinispan/blob/a4483d08b92d301350823c7fd42725c...
> >
> > so far, only the tests with eviction are failing...
>
> Some of the failures you're seeing are caused by internal "assert"
> keyword in Lucene's code, which have the purpose of verifying the
> "filesystem" is going to be synched properly.
> These assertions don't apply when using our storage: we don't need
> this synch to happen: in fact if it weren't because of the assertions
> the whole method would be a no-op as it finally delegates all logic to
> a method in the Infinispan Directory which is a no-op.
>
> In other words, these are misleading failures and we'd need to avoid
> the TestNG "feature" of enabling assertions in this case.
>
> Still, even if the stacktrace is misleading, I agree on your diagnosis
> below.
>
> Could you reproduce the problem without involving also the Query
> framework?
> I'd hope that such a test could be independent and live solely in the
> lucene-directory module; in practice if you can only reproduce it with
> the query module it makes me suspicious that we're actually debugging
> a race condition in the initialization of the two services: a race
> between the query initialization thread needing to check the index
> state (so potentially triggering a load from cachestore), and the
> thread performing the cachestore preload.
> (I see your test also fails without preload, but just wondering if
> that might be an additional complexity).
>
> >>
> >> Let's step back a second and consider the Cache usage from the point
> >> of view of FileListOperations.
> >> Note that even if you have two threads writing at the same time, as
> >> long as they are on the same node they will be adding/removing
> >> elements from the same instance of a ConcurrentHashMap.
> >> Since it's the same instance, it doesn't matter which thread will
do
> >> the put operation as last: it will push the correct state.
> >> (there is an assumptions here, but we can forget about those for the
> >> sake of this debugging: same node -> fine as there is an external
> >> lock, no other node is allowed to write at the same time)
> >>
> >
> > 100% agreed with you but with cache store, we no longer ensure that 2
> > (or
> > more) threads are pointing to the same instance of Concurrent Hash Set.
> >
> > With eviction, the entries are removed from in-memory container and
> > persisted in the cache store. The scenario I've described, 2 threads are
> > trying to add/remove a file and the file list does not exist in-memory.
> > So,
> > each thread will read from cache store and deserialize the byte array.
> > In
> > the end, each thread will have a pointer for different instances of
> > ConcurrentHashSet but with the same elements. And when this happens, we
> > lost
> > one of the operation.
>
> I'm seeing more than a couple of different smelly behaviors interacting
> here:
>
> 1## The single instance ConcurrentHashSet
> I guess this could be re-thought as it's making some strong
> assumptions, but considering this service can't be transactional I'd
> rather explore other solutions first as I think the following changes
> should be good enough.
>
> 2## Eviction not picking the right entry
> This single key is literally read for each and every performed query,
> and all writes as well. Each write, will write on this key.
> Even with eviction being enabled on the cache, I would never expect
> this key to be actually evicted!
>
> # Action 1: Open an issue to investigate the eviction choice: the
> strategy seems to be making a very poor job (or maybe it's just that
> maxEntries(10) is too low and makes LIRS degenerate into insane
> choices).
That well may be the case: the default concurrency level is 32, so with a
capacity of 10 you will have just 1 entry per segment. If there is more than
one entry in the same segment and they are both "hot", they will take turns
being evicted.
The only thing we could change here is to make LIRS/LRU work for the entire
container at once, but right now I don't think they're scalable enough.
thanks Dan, great insight.
No I don't think we need to improve anything to make an unrealistic test work;
should we validate against an unreasonably low maxEntries value?
At least log a warning?
On the LIRS investigation proposal, I assume we already have tests which verify
the choice of which elements to be evicted is reasonable?
> # Action 2: I think that for now we could disallow usage of
eviction
> on the metadata cache. I didn't have tests using it, as I wouldn't
> recommended such a configuration as these entries are very hot and
> very small: viable to make it an illegal option?
>
> 3## The CacheLoader loading the same entry multiple times in parallel
> Kudos for finding out that there are situations in which we deal with
> multiple different instances of ConcurrentHashSet! Still, I think that
> Infinispan core is terribly wrong in this case:
> from the client code POV a new CHS is created with a put-if-absent
> atomic operation, and I will assume there that core will check/load
> any enabled cachestore as well.
> To handle multiple GET operations in parallel, or even in parallel
> with preload or the client's put-if-absent operation, I would *not*
> expect Infinispan core to storm the CacheStore implementation with
> multiple load operations on the same put: a lock should be hold on the
> potential key during such a load operation.
>
> If your observation is right, this could also be one of the reasons
> for so many complaints on the CacheStore performance: under these
> circumstances - which I'd guesstimate are quite common - we're doing
> lots of unnecessary IO, potentially stressing the slower storage
> devices. This could even have dramatic effects if there are frequent
> requests for entries for which the returned value is a null.
>
> # Action 3: Investigate and open a JIRA about the missing locking on
> CacheStore load operations.
>
> If this where resolves, we would still have a guarantee of single
> instance, am I correct?
I don't think so.
Consider this scenario:
1. thread 1: set1 = cache.get(key)
2. eviction thread: evict/passivate
3. thread 2: set2 = cache.get(key)
We don't know if a thread is still holding a reference to a value during
eviction, so we can't guarantee that there will always be a single instance
of that value.
OTOH, if you need this guarantee, and you don't really need eviction,
wouldn't it be simpler to use a plain ConcurrentHashMap instead?
I wish I could, but the CHM instance needs to be shared with other
Directory instances
running on the same Cache so it needs to be retrieved at least once
from the Cache, and I need
any new change to update the copy being stored in a CacheStore, so
that in case of crash we
can reload the correct state.
Your description is indeed an example in which this wouldn't work; the
Directory code is
not holding a reference to it for longer than the method scope but
indeed that might be too long :)
Let's disallow eviction, that should be good enough.
Alternatively we could register the CHM in an internal service to
guarantee unique in-memory instance
across Directory instances sharing the same cache.. but I don't think
there is need for eviction on the
metadata.
Sanne
>
> >
> > Also, the problem is easily reproduce when you enable the storeAsBinary
> > for
> > values because each cache.get will deserialize the byte array and create
> > different instances.
>
> That's another option that I've never used on a cache meant to store a
> Lucene index as it doesn't make sense for the purpose.
> But you're right that's an assumption I haven't thought of:
>
> # Action 4: What do you think of promoting that as an illegal
> configuration? created ISPN-3593
>
> >
> > That's why I think we would need a transaction.
>
> -1 the Directory can not drive a transaction as we hope some day in
> the future to include changes to the index in a user transaction.
>
> Cheers,
> Sanne
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev