Stale locks: stress test for the ReplaceCommand
by Sanne Grinovero
Hello,
I just pushed branch ReplaceOperationStressTest to my github
repository; this started initially to verify correctness of the
Cache##replace(Object, Object, Object) operation
but it wouldn't work because of lock timeouts on high load.
I was initially assuming that I was hitting the same issue Manik was
working on, still I refactored the test to keep the stress point on
the concurrent writes but use a cyclic barrier to give some fairness
and enough time to each thread between each test trigger.
So now this test is doing, in very simplified pseudo code:
for ( each Cache Mode) {
for ( many thousands of iterations ) {
1# many threads wait for each other
2# each thread picks a cache instance (from different
cachemanagers connected to each other unless it's LOCAL)
3# each thread attempts a valid replace() operation on the chosen cache
4# each thread waits again that each other thread is done with
the replace, then we run state checks.
}
}
Using this pattern when we "shoot" all threads on the replace()
operation at the same time and then wait, so that I know for sure that
contention is not going to last longer on the key than the needed time
to perform the single operation, and then each thread gets lots of
fair time to acquire the lock.
Now the bad news: not only this is proving that the replace()
operation is equally broken on every Cache Mode, but also often it
fails because some of the threads throw:
org.infinispan.util.concurrent.TimeoutException: Unable to acquire
lock after [10 seconds] on key [thisIsTheKeyForConcurrentAccess] for
requestor [Thread[OOB-4,ISPN,ReplaceOperationStressTest-NodeO-1000,5,Thread
Pools]]! Lock held by [Thread[pool-35-thread-2,5,main]]
at org.infinispan.util.concurrent.locks.LockManagerImpl.lock(LockManagerImpl.java:217)
at org.infinispan.util.concurrent.locks.LockManagerImpl.acquireLockNoCheck(LockManagerImpl.java:200)
at org.infinispan.interceptors.locking.AbstractLockingInterceptor.lockKey(AbstractLockingInterceptor.java:115)
at org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor.visitReplaceCommand(NonTransactionalLockingInterceptor.java:118)
at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)
and I have no other explanation than that locks aren't always released.
I'm not running too many threads: I'm currently using 9 threads
picking among 5 clustered CacheManagers, but fails with 2 too; it
doesn't take many cycles to fail either, actually in some cluster
modes it often fails at the first loop iteration (which initially
mislead me in thinking some modes worked fine, that was just my test
not being safe enough).
Funnily enough while writing this it just failed a run even in single
thread mode: in one iteration it was spotted that the lock wasn't
cleaned up; this was REPL_SYNC+TX; I don't think the CacheMode was
relevant, more that this is quite unlikely and the number of
iterations isn't high enough to certify correctness of all other
modes; still annoying that apparently it's not even deterministic in
single thread.
Anyone available to help me out? And please have a look at my test, I
might be doing some mistake?
Cheers,
Sanne
12 years, 1 month
Transaction table cleanup
by Vladimir Blagojevic
Hey guys,
Investigating why EmbeddedCacheManager#cacheRemove hiccups
DistributedTwoNodesMapReduceTest. As you might recall upon end of
MapReduceTask there is EmbeddedCacheManager#cacheRemove call to remove
intermediate caches across the cluster. Very often, almost every test
run, execution of cache remove is blocked by ongoing transactions that
have not completed - more specifically TransactionTable shows pending
remote transaction. This in turn prevents cache stop call, which in turn
causes a timeout on cache remove thus failing the test.
MapReduce uses txs in MapReduceManagerImpl#combine - it might well be
that there is smth wrong there or it could be that somehow
TxCompletionNotificationCommand is not cleaning up remote txs. Either
way I would appreciate some help here - Mircea?
Regards,
Vladimir
12 years, 1 month
Testsuite: hanging TestNG, CDI proken
by Sanne Grinovero
Hello all,
besides having regular failures, I also experienced occasional hangs
while running the testsuite; in some cases I found the following stack
which suggests a TestNG bug:
"pool-3-thread-14" prio=10 tid=0x00007f0d84632000 nid=0x1ce5 runnable
[0x00007f0d58a36000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.put(HashMap.java:374)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:320)
at org.testng.SuiteRunner.access$000(SuiteRunner.java:34)
at org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:351)
at org.testng.internal.thread.ThreadUtil$CountDownLatchedRunnable.run(ThreadUtil.java:147)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Even when taking multiple dumps that thread is still in the same loop,
and having a single CPU stuck at 100% I'm guessing the HashMap was
being used in some unsafe way regarding concurrency; we're using the
first minor version of TestNG which ever supported parallel testsuite
invocations so that might not be very solid.
Not sure why, but upgrading TestNG from 5.14.10 to 6.7 seems to
resolve the problem.
Now I wish I could send a pull request, but even skipping just the
core testsuite (which always fails for me even in non-parallel mode)
many other modules are broken both with and without my patches, so I'm
dropping my experiments as I won't send any pull requests if the tests
can't back my changes up.
As an example the CDI integration is:
Tests run: 247, Failures: 102, Errors: 0, Skipped: 143
...which means 2 tests are fine.
Cheers,
Sanne
12 years, 1 month
cce on invocation context
by Ales Justin
I'm constantly seeing this CCE while running CapeDwarf cluster tests:
(running 5.2.Beta2 with my iterator offset patch)
17:43:10,175 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-18,null) ISPN000136: Execution error: java.lang.ClassCastException: org.infinispan.context.impl.NonTxInvocationContext cannot be cast to org.infinispan.context.impl.TxInvocationContext
at org.infinispan.interceptors.locking.PessimisticLockingInterceptor.visitPutKeyValueCommand(PessimisticLockingInterceptor.java:114)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:63)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
at org.infinispan.interceptors.TxInterceptor.enlistWriteAndInvokeNext(TxInterceptor.java:212)
at org.infinispan.interceptors.TxInterceptor.visitPutKeyValueCommand(TxInterceptor.java:150)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:207)
at org.infinispan.statetransfer.StateTransferInterceptor.handleWriteCommand(StateTransferInterceptor.java:191)
at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:136)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:127)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:129)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:93)
at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:63)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:347)
at org.infinispan.statetransfer.StateConsumerImpl.doApplyState(StateConsumerImpl.java:306)
at org.infinispan.statetransfer.StateConsumerImpl.applyState(StateConsumerImpl.java:264)
at org.infinispan.statetransfer.StateResponseCommand.perform(StateResponseCommand.java:86)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:95)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:110)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:82)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:244)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:217)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:483)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:390)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:248)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:604)
at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:130)
at org.jgroups.JChannel.up(JChannel.java:670)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)
at org.jgroups.protocols.RSVP.up(RSVP.java:172)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)
at org.jgroups.protocols.UNICAST2.handleDataReceived(UNICAST2.java:736)
at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:414)
at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:645)
at org.jgroups.protocols.BARRIER.up(BARRIER.java:102)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)
at org.jgroups.protocols.FD.up(FD.java:273)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
at org.jgroups.protocols.Discovery.up(Discovery.java:359)
at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2646)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1293)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1856)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1829)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [classes.jar:1.6.0_37]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [classes.jar:1.6.0_37]
at java.lang.Thread.run(Thread.java:680) [classes.jar:1.6.0_37]
12 years, 1 month
broken lazy query iteration
by Ales Justin
After searching for the needed in haystack, I finally found the problem.
(not to mention complete lack of tests for this *basic* feature ...)
The problem is with queries with offset when you iterate over them -- offset is never taken into account.
There are two possible fixes -- as I see them.
1) In HS:
DocumentExtractorImpl::extract takes into account "firstIndex"
public EntityInfo extract(int scoreDocIndex) throws IOException {
int docId = queryHits.docId( firstIndex + scoreDocIndex );
Document document = extractDocument( fistIndex + scoreDocIndex );
2) LazyIterator in Infinispan-Query applies the offset:
protected EntityInfo loadEntityInfo(int index) {
try {
return extractor.extract(extractor.getFirstIndex() + index);
---
Since those methods are exposed in DocumentExtractor,
I would guess they were meant for external code to use them,
instead of putting this logic into extractor itself.
So, I'll go ahead and provide a patch for (2).
-Ales
12 years, 1 month
infinispan 5.2.0.Beta3
by Mircea Markus
Hi,
We have a lot of pull requests pending. Until Beta3 is released, can we please focus on these and slow down the development for now.
I've grouped them as follows. Please feel free to shuffle them around/ ask for more feedback if you think appropriate, but please take ownership of them and make sure they get integrated.
Dan:
ISPN-2373 State transfer does not end because some segments are erroneously reported as unreceived
Lookup optimisation in TransactionTable.getLocalTransaction and cleanup in BaseRpcInterceptor hierarchy
ISPN-2381 Locks are removed even if not successfully unlocked
Adrian:
ISPN-2318 Reimplement a Topology-Aware Consistent Hash
Galder:
ISPN-2429 Cache restart still doesn't work properly for query-enabled caches
JBQA-6819 - Added the ant script which merges and generates jacoco code coverage report file.
ISPN-2412 Allow specifying container and cache when connecting via CLI
Tristan:
[5.1.x] ISPN-2414 Fixes to reduce memory consumption of local caches
ISPN-2414 Fixes to reduce memory consumption of local caches
Mircea:
ISPN-2440 JGroupsTransport.invokeRemotely throws SuspectExceptions even ...
Fix DummyInMemoryCacheStoreConfigurationBuilder#read()
ISPN-2371 The global component registry fails to start components
ISPN-2443 - tests are added for reproducing/verifying the issue.
ISPN-2386 - Test reproducing/verifying the issue with ClassCastingException in case of CacheLoader usage (with storeAsBinary conf).
ISPN-1042 - Enable distributed and Map/Reduce task interruption/cancellation
Vladimir:
ISPN-2409 - Reproduction/verification case for NotSerializableException occurence.
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
12 years, 1 month
Unit test PR for master
by Vladimir Blagojevic
Hi,
I noticed that we recently have had many PR for unit tests integration
into master. Would it not make more sense to integrate unit tests with
actual fixes rather than having them merged directly to master as PRs?
I'd say leave a unit test attached/referenced in JIRA and a
developer/contributor fixing the issue will integrate both the fix and
the unit test in a single PR.
WDYT?
Regards,
Vladimir
12 years, 1 month
Cluster scaling
by Matej Lazar
Hi,
to simplify a scenario lets say that I have a distributed cache with no copies on a cluster of two nodes.
Is stop operation blocked and waiting for cache data to be transferred to the other node, when one node is stopped (normal stop, not a failure)?
My use case is using Infinispan in JBoss AS (CapeDwarf).
How is node stopping behaving in Infinispan standalone and how in AS ?
Thanks,
Matej.
12 years, 1 month