[infinispan-dev] Stale locks: stress test for the ReplaceCommand

Tue Oct 30 10:00:51 EDT 2012

Hello,
I just pushed branch ReplaceOperationStressTest to my github
repository; this started initially to verify correctness of the
Cache##replace(Object, Object, Object) operation
but it wouldn't work because of lock timeouts on high load.

I was initially assuming that I was hitting the same issue Manik was
working on, still I refactored the test to keep the stress point on
the concurrent writes but use a cyclic barrier to give some fairness
and enough time to each thread between each test trigger.

So now this test is doing, in very simplified pseudo code:

for ( each Cache Mode) {
   for ( many thousands of iterations ) {
       1# many threads wait for each other
       2# each thread picks a cache instance (from different
cachemanagers connected to each other unless it's LOCAL)
       3# each thread attempts a valid replace() operation on the chosen cache
       4# each thread waits again that each other thread is done with
the replace, then we run state checks.
   }
}

Using this pattern when we "shoot" all threads on the replace()
operation at the same time and then wait, so that I know for sure that
contention is not going to last longer on the key than the needed time
to perform the single operation, and then each thread gets lots of
fair time to acquire the lock.

Now the bad news: not only this is proving that the replace()
operation is equally broken on every Cache Mode, but also often it
fails because some of the threads throw:

org.infinispan.util.concurrent.TimeoutException: Unable to acquire
lock after [10 seconds] on key [thisIsTheKeyForConcurrentAccess] for
requestor [Thread[OOB-4,ISPN,ReplaceOperationStressTest-NodeO-1000,5,Thread
Pools]]! Lock held by [Thread[pool-35-thread-2,5,main]]
	at org.infinispan.util.concurrent.locks.LockManagerImpl.lock(LockManagerImpl.java:217)
	at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLockNoCheck(LockManagerImpl.java:200)
	at org.infinispan.interceptors.locking.AbstractLockingInterceptor.lockKey(AbstractLockingInterceptor.java:115)
	at org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor.visitReplaceCommand(NonTransactionalLockingInterceptor.java:118)
	at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)

and I have no other explanation than that locks aren't always released.

I'm not running too many threads: I'm currently using 9 threads
picking among 5 clustered CacheManagers, but fails with 2 too; it
doesn't take many cycles to fail either, actually in some cluster
modes it often fails at the first loop iteration (which initially
mislead me in thinking some modes worked fine, that was just my test
not being safe enough).

Funnily enough while writing this it just failed a run even in single
thread mode: in one iteration it was spotted that the lock wasn't
cleaned up; this was REPL_SYNC+TX; I don't think the CacheMode was
relevant, more that this is quite unlikely and the number of
iterations isn't high enough to certify correctness of all other
modes; still annoying that apparently it's not even deterministic in
single thread.

Anyone available to help me out? And please have a look at my test, I
might be doing some mistake?

Cheers,
Sanne