[infinispan-dev] Need help

Mon Oct 7 06:16:14 EDT 2013

On 10/07/2013 10:44 AM, Dan Berindei wrote:
>
>
>
> On Mon, Oct 7, 2013 at 2:30 AM, Sanne Grinovero <sanne at infinispan.org
> <mailto:sanne at infinispan.org>> wrote:
>
>     On 6 October 2013 00:01, Pedro Ruivo <pedro at infinispan.org
>     <mailto:pedro at infinispan.org>> wrote:
>      > Hi Sanne.
>      >
>      > Thanks for your comments. please see inline...
>      >
>      > Cheers,
>      > Pedro
>      >
>      >
>      > On 10/05/2013 09:15 PM, Sanne Grinovero wrote:
>      >>
>      >> Hi Pedro,
>      >> looks like you're diving in some good fun :-)
>      >> BTW please keep the dev discussions on the mailing list, adding it.
>      >>
>      >> inline :
>      >>
>      >> On 4 October 2013 22:01, Pedro Ruivo <pedro at infinispan.org
>     <mailto:pedro at infinispan.org>> wrote:
>      >>>
>      >>> Hi,
>      >>>
>      >>> Sanne I need your expertise in here. I'm afraid that the
>     problem is in
>      >>> FileListOperations :(
>      >>> I think the FileListOperations implementation needs a
>     transactional cache
>      >>> with strong consistency...
>      >>>
>      >>> I'm 99% sure that it is originating the
>     java.lang.AssertionError: file
>      >>> XPTO
>      >>> does not exist. I find out that we have multiple threads adding and
>      >>> removing
>      >>> files from the list. The scenario in [1] we see 2 threads
>     loading the key
>      >>> from the cache loader and one thread adds a file and other
>     removes. the
>      >>> thread that removes is the last one to commit and the file list is
>      >>> updated
>      >>> to an old state. When it tries to updat an index, I got the
>     assertion
>      >>> error.
>      >>
>      >>
>      >> Nice, looks like you're on something.
>      >> I've never seen specifically an AssertionError, looks like you
>     have a
>      >> new test. Could you share it?
>      >
>      >
>      > yes of course:
>      >
>     https://github.com/pruivo/infinispan/blob/a4483d08b92d301350823c7fd42725c339a65c7b/query/src/test/java/org/infinispan/query/cacheloaders/CacheStoreTest.java
>      >
>      > so far, only the tests with eviction are failing...
>
>     Some of the failures you're seeing are caused by internal "assert"
>     keyword in Lucene's code, which have the purpose of verifying the
>     "filesystem" is going to be synched properly.
>     These assertions don't apply when using our storage: we don't need
>     this synch to happen: in fact if it weren't because of the assertions
>     the whole method would be a no-op as it finally delegates all logic to
>     a method in the Infinispan Directory which is a no-op.
>
>     In other words, these are misleading failures and we'd need to avoid
>     the TestNG "feature" of enabling assertions in this case.
>
>     Still, even if the stacktrace is misleading, I agree on your
>     diagnosis below.
>
>     Could you reproduce the problem without involving also the Query
>     framework?
>     I'd hope that such a test could be independent and live solely in the
>     lucene-directory module; in practice if you can only reproduce it with
>     the query module it makes me suspicious that we're actually debugging
>     a race condition in the initialization of the two services: a race
>     between the query initialization thread needing to check the index
>     state (so potentially triggering a load from cachestore), and the
>     thread performing the cachestore preload.
>     (I see your test also fails without preload, but just wondering if
>     that might be an additional complexity).
>
>      >>
>      >> Let's step back a second and consider the Cache usage from the point
>      >> of view of FileListOperations.
>      >> Note that even if you have two threads writing at the same time, as
>      >> long as they are on the same node they will be adding/removing
>      >> elements from the same instance of a ConcurrentHashMap.
>      >> Since it's the same instance, it doesn't matter which thread will do
>      >> the put operation as last: it will push the correct state.
>      >> (there is an assumptions here, but we can forget about those for the
>      >> sake of this debugging: same node -> fine as there is an external
>      >> lock, no other node is allowed to write at the same time)
>      >>
>      >
>      > 100% agreed with you but with cache store, we no longer ensure
>     that 2 (or
>      > more) threads are pointing to the same instance of Concurrent
>     Hash Set.
>      >
>      > With eviction, the entries are removed from in-memory container and
>      > persisted in the cache store. The scenario I've described, 2
>     threads are
>      > trying to add/remove a file and the file list does not exist
>     in-memory. So,
>      > each thread will read from cache store and deserialize the byte
>     array. In
>      > the end, each thread will have a pointer for different instances of
>      > ConcurrentHashSet but with the same elements. And when this
>     happens, we lost
>      > one of the operation.
>
>     I'm seeing more than a couple of different smelly behaviors
>     interacting here:
>
>     1## The single instance ConcurrentHashSet
>     I guess this could be re-thought as it's making some strong
>     assumptions, but considering this service can't be transactional I'd
>     rather explore other solutions first as I think the following changes
>     should be good enough.
>
>     2## Eviction not picking the right entry
>     This single key is literally read for each and every performed query,
>     and all writes as well. Each write, will write on this key.
>     Even with eviction being enabled on the cache, I would never expect
>     this key to be actually evicted!
>
>       # Action 1: Open an issue to investigate the eviction choice: the
>     strategy seems to be making a very poor job (or maybe it's just that
>     maxEntries(10) is too low and makes LIRS degenerate into insane
>     choices).
>
>
> That well may be the case: the default concurrency level is 32, so with
> a capacity of 10 you will have just 1 entry per segment. If there is
> more than one entry in the same segment and they are both "hot", they
> will take turns being evicted.
>
> The only thing we could change here is to make LIRS/LRU work for the
> entire container at once, but right now I don't think they're scalable
> enough.
>
>
>       # Action 2: I think that for now we could disallow usage of eviction
>     on the metadata cache. I didn't have tests using it, as I wouldn't
>     recommended such a configuration as these entries are very hot and
>     very small: viable to make it an illegal option?
>
>     3## The CacheLoader loading the same entry multiple times in parallel
>     Kudos for finding out that there are situations in which we deal with
>     multiple different instances of ConcurrentHashSet! Still, I think that
>     Infinispan core is terribly wrong in this case:
>     from the client code POV a new CHS is created with a put-if-absent
>     atomic operation, and I will assume there that core will check/load
>     any enabled cachestore as well.
>     To handle multiple GET operations in parallel, or even in parallel
>     with preload or the client's put-if-absent operation, I would *not*
>     expect Infinispan core to storm the CacheStore implementation with
>     multiple load operations on the same put: a lock should be hold on the
>     potential key during such a load operation.
>
>     If your observation is right, this could also be one of the reasons
>     for so many complaints on the CacheStore performance: under these
>     circumstances - which I'd guesstimate are quite common - we're doing
>     lots of unnecessary IO, potentially stressing the slower storage
>     devices. This could even have dramatic effects if there are frequent
>     requests for entries for which the returned value is a null.
>
>       # Action 3: Investigate and open a JIRA about the missing locking on
>     CacheStore load operations.
>
>     If this where resolves, we would still have a guarantee of single
>     instance, am I correct?
>
>
> I don't think so.
>
> Consider this scenario:
>
> 1. thread 1: set1 = cache.get(key)
> 2. eviction thread: evict/passivate
> 3. thread 2: set2 = cache.get(key)
>
> We don't know if a thread is still holding a reference to a value during
> eviction, so we can't guarantee that there will always be a single
> instance of that value.
>
> OTOH, if you need this guarantee, and you don't really need eviction,
> wouldn't it be simpler to use a plain ConcurrentHashMap instead?

I don't think so because it needs to be replicated. the directory only 
has one node writing to that key and the remaining nodes are read-only 
nodes. They need to still have access to that list. Correct me if I'm 
wrong Sanne.

>
>
>
>      >
>      > Also, the problem is easily reproduce when you enable the
>     storeAsBinary for
>      > values because each cache.get will deserialize the byte array and
>     create
>      > different instances.
>
>     That's another option that I've never used on a cache meant to store a
>     Lucene index as it doesn't make sense for the purpose.
>     But you're right that's an assumption I haven't thought of:
>
>        # Action 4: What do you think of promoting that as an illegal
>     configuration? created ISPN-3593
>
>      >
>      > That's why I think we would need a transaction.
>
>     -1 the Directory can not drive a transaction as we hope some day in
>     the future to include changes to the index in a user transaction.
>
>     Cheers,
>     Sanne
>     _______________________________________________
>     infinispan-dev mailing list
>     infinispan-dev at lists.jboss.org <mailto:infinispan-dev at lists.jboss.org>
>     https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>