[infinispan-dev] Need help
Pedro Ruivo
pedro at infinispan.org
Mon Oct 7 06:16:14 EDT 2013
On 10/07/2013 10:44 AM, Dan Berindei wrote:
>
>
>
> On Mon, Oct 7, 2013 at 2:30 AM, Sanne Grinovero <sanne at infinispan.org
> <mailto:sanne at infinispan.org>> wrote:
>
> On 6 October 2013 00:01, Pedro Ruivo <pedro at infinispan.org
> <mailto:pedro at infinispan.org>> wrote:
> > Hi Sanne.
> >
> > Thanks for your comments. please see inline...
> >
> > Cheers,
> > Pedro
> >
> >
> > On 10/05/2013 09:15 PM, Sanne Grinovero wrote:
> >>
> >> Hi Pedro,
> >> looks like you're diving in some good fun :-)
> >> BTW please keep the dev discussions on the mailing list, adding it.
> >>
> >> inline :
> >>
> >> On 4 October 2013 22:01, Pedro Ruivo <pedro at infinispan.org
> <mailto:pedro at infinispan.org>> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Sanne I need your expertise in here. I'm afraid that the
> problem is in
> >>> FileListOperations :(
> >>> I think the FileListOperations implementation needs a
> transactional cache
> >>> with strong consistency...
> >>>
> >>> I'm 99% sure that it is originating the
> java.lang.AssertionError: file
> >>> XPTO
> >>> does not exist. I find out that we have multiple threads adding and
> >>> removing
> >>> files from the list. The scenario in [1] we see 2 threads
> loading the key
> >>> from the cache loader and one thread adds a file and other
> removes. the
> >>> thread that removes is the last one to commit and the file list is
> >>> updated
> >>> to an old state. When it tries to updat an index, I got the
> assertion
> >>> error.
> >>
> >>
> >> Nice, looks like you're on something.
> >> I've never seen specifically an AssertionError, looks like you
> have a
> >> new test. Could you share it?
> >
> >
> > yes of course:
> >
> https://github.com/pruivo/infinispan/blob/a4483d08b92d301350823c7fd42725c339a65c7b/query/src/test/java/org/infinispan/query/cacheloaders/CacheStoreTest.java
> >
> > so far, only the tests with eviction are failing...
>
> Some of the failures you're seeing are caused by internal "assert"
> keyword in Lucene's code, which have the purpose of verifying the
> "filesystem" is going to be synched properly.
> These assertions don't apply when using our storage: we don't need
> this synch to happen: in fact if it weren't because of the assertions
> the whole method would be a no-op as it finally delegates all logic to
> a method in the Infinispan Directory which is a no-op.
>
> In other words, these are misleading failures and we'd need to avoid
> the TestNG "feature" of enabling assertions in this case.
>
> Still, even if the stacktrace is misleading, I agree on your
> diagnosis below.
>
> Could you reproduce the problem without involving also the Query
> framework?
> I'd hope that such a test could be independent and live solely in the
> lucene-directory module; in practice if you can only reproduce it with
> the query module it makes me suspicious that we're actually debugging
> a race condition in the initialization of the two services: a race
> between the query initialization thread needing to check the index
> state (so potentially triggering a load from cachestore), and the
> thread performing the cachestore preload.
> (I see your test also fails without preload, but just wondering if
> that might be an additional complexity).
>
> >>
> >> Let's step back a second and consider the Cache usage from the point
> >> of view of FileListOperations.
> >> Note that even if you have two threads writing at the same time, as
> >> long as they are on the same node they will be adding/removing
> >> elements from the same instance of a ConcurrentHashMap.
> >> Since it's the same instance, it doesn't matter which thread will do
> >> the put operation as last: it will push the correct state.
> >> (there is an assumptions here, but we can forget about those for the
> >> sake of this debugging: same node -> fine as there is an external
> >> lock, no other node is allowed to write at the same time)
> >>
> >
> > 100% agreed with you but with cache store, we no longer ensure
> that 2 (or
> > more) threads are pointing to the same instance of Concurrent
> Hash Set.
> >
> > With eviction, the entries are removed from in-memory container and
> > persisted in the cache store. The scenario I've described, 2
> threads are
> > trying to add/remove a file and the file list does not exist
> in-memory. So,
> > each thread will read from cache store and deserialize the byte
> array. In
> > the end, each thread will have a pointer for different instances of
> > ConcurrentHashSet but with the same elements. And when this
> happens, we lost
> > one of the operation.
>
> I'm seeing more than a couple of different smelly behaviors
> interacting here:
>
> 1## The single instance ConcurrentHashSet
> I guess this could be re-thought as it's making some strong
> assumptions, but considering this service can't be transactional I'd
> rather explore other solutions first as I think the following changes
> should be good enough.
>
> 2## Eviction not picking the right entry
> This single key is literally read for each and every performed query,
> and all writes as well. Each write, will write on this key.
> Even with eviction being enabled on the cache, I would never expect
> this key to be actually evicted!
>
> # Action 1: Open an issue to investigate the eviction choice: the
> strategy seems to be making a very poor job (or maybe it's just that
> maxEntries(10) is too low and makes LIRS degenerate into insane
> choices).
>
>
> That well may be the case: the default concurrency level is 32, so with
> a capacity of 10 you will have just 1 entry per segment. If there is
> more than one entry in the same segment and they are both "hot", they
> will take turns being evicted.
>
> The only thing we could change here is to make LIRS/LRU work for the
> entire container at once, but right now I don't think they're scalable
> enough.
>
>
> # Action 2: I think that for now we could disallow usage of eviction
> on the metadata cache. I didn't have tests using it, as I wouldn't
> recommended such a configuration as these entries are very hot and
> very small: viable to make it an illegal option?
>
> 3## The CacheLoader loading the same entry multiple times in parallel
> Kudos for finding out that there are situations in which we deal with
> multiple different instances of ConcurrentHashSet! Still, I think that
> Infinispan core is terribly wrong in this case:
> from the client code POV a new CHS is created with a put-if-absent
> atomic operation, and I will assume there that core will check/load
> any enabled cachestore as well.
> To handle multiple GET operations in parallel, or even in parallel
> with preload or the client's put-if-absent operation, I would *not*
> expect Infinispan core to storm the CacheStore implementation with
> multiple load operations on the same put: a lock should be hold on the
> potential key during such a load operation.
>
> If your observation is right, this could also be one of the reasons
> for so many complaints on the CacheStore performance: under these
> circumstances - which I'd guesstimate are quite common - we're doing
> lots of unnecessary IO, potentially stressing the slower storage
> devices. This could even have dramatic effects if there are frequent
> requests for entries for which the returned value is a null.
>
> # Action 3: Investigate and open a JIRA about the missing locking on
> CacheStore load operations.
>
> If this where resolves, we would still have a guarantee of single
> instance, am I correct?
>
>
> I don't think so.
>
> Consider this scenario:
>
> 1. thread 1: set1 = cache.get(key)
> 2. eviction thread: evict/passivate
> 3. thread 2: set2 = cache.get(key)
>
> We don't know if a thread is still holding a reference to a value during
> eviction, so we can't guarantee that there will always be a single
> instance of that value.
>
> OTOH, if you need this guarantee, and you don't really need eviction,
> wouldn't it be simpler to use a plain ConcurrentHashMap instead?
I don't think so because it needs to be replicated. the directory only
has one node writing to that key and the remaining nodes are read-only
nodes. They need to still have access to that list. Correct me if I'm
wrong Sanne.
>
>
>
> >
> > Also, the problem is easily reproduce when you enable the
> storeAsBinary for
> > values because each cache.get will deserialize the byte array and
> create
> > different instances.
>
> That's another option that I've never used on a cache meant to store a
> Lucene index as it doesn't make sense for the purpose.
> But you're right that's an assumption I haven't thought of:
>
> # Action 4: What do you think of promoting that as an illegal
> configuration? created ISPN-3593
>
> >
> > That's why I think we would need a transaction.
>
> -1 the Directory can not drive a transaction as we hope some day in
> the future to include changes to the index in a user transaction.
>
> Cheers,
> Sanne
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org <mailto:infinispan-dev at lists.jboss.org>
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
More information about the infinispan-dev
mailing list