[JBoss JIRA] (ISPN-5459) StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-5459?page=com.atlassian.jira.plugin.... ]
Dan Berindei reassigned ISPN-5459:
----------------------------------
Assignee: William Burns (was: Dan Berindei)
> StateTransferManager.waitForInitialTransferToComplete can fail if the coordinator crashes
> -----------------------------------------------------------------------------------------
>
> Key: ISPN-5459
> URL: https://issues.jboss.org/browse/ISPN-5459
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 7.2.1.Final
> Reporter: Dan Berindei
> Assignee: William Burns
> Priority: Critical
> Labels: testsuite_stability
> Fix For: 8.0.0.Alpha1
>
>
> {{LocalTopologyManagerImpl.isRebalancingEnabled()}} will throw a {{SuspectException}} if the coordinator crashes, preventing the cache from starting up.
> This is causing random failures in {{ClusterListenerDistTxAddListenerTest}}:
> {noformat}
> 22:23:59,439 ERROR (testng-ClusterListenerDistTxAddListenerTest:) [UnitTestTestNGListener] Test testNodeJoiningAndStateNodeDiesWithExistingClusterListener(org.infinispan.notifications.cachelistener.cluster.ClusterListenerDistTxAddListenerTest) failed.
> java.util.concurrent.ExecutionException: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.Exception on object of type StateTransferManagerImpl
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:202)
> at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.testNodeJoiningAndStateNodeDiesWithExistingClusterListener(AbstractClusterListenerDistAddListenerTest.java:254)
> ...
> Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.Exception on object of type StateTransferManagerImpl
> at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:172)
> at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:869)
> at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:638)
> at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:627)
> at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:530)
> at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:218)
> at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:850)
> at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:599)
> at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:554)
> at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:424)
> at org.infinispan.test.MultipleCacheManagersTest.cache(MultipleCacheManagersTest.java:366)
> at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest.access$100(AbstractClusterListenerDistAddListenerTest.java:32)
> at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest$4.call(AbstractClusterListenerDistAddListenerTest.java:237)
> at org.infinispan.notifications.cachelistener.cluster.AbstractClusterListenerDistAddListenerTest$4.call(AbstractClusterListenerDistAddListenerTest.java:234)
> at org.infinispan.test.AbstractInfinispanTest$LoggingCallable.call(AbstractInfinispanTest.java:422)
> ... 4 more
> Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: Node NodeM-34961 was suspected
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:245)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:566)
> at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator(LocalTopologyManagerImpl.java:501)
> at org.infinispan.topology.LocalTopologyManagerImpl.isRebalancingEnabled(LocalTopologyManagerImpl.java:445)
> at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:216)
> at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:168)
> ... 18 more
> Caused by: SuspectedException
> at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:414)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:427)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:240)
> ... 26 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
10 years, 10 months
[JBoss JIRA] (ISPN-5252) Override toString() of org.infinispan.registry.ScopedKey
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-5252?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-5252:
-----------------------------------------------
Sebastian Łaskawiec <slaskawi(a)redhat.com> changed the Status of [bug 1203565|https://bugzilla.redhat.com/show_bug.cgi?id=1203565] from MODIFIED to ON_QA
> Override toString() of org.infinispan.registry.ScopedKey
> --------------------------------------------------------
>
> Key: ISPN-5252
> URL: https://issues.jboss.org/browse/ISPN-5252
> Project: Infinispan
> Issue Type: Feature Request
> Components: Core
> Affects Versions: 7.2.0.Alpha1, 7.1.1.Final
> Reporter: Osamu Nagano
> Assignee: Osamu Nagano
> Fix For: 7.2.0.Beta2, 7.2.0.Final
>
>
> A lock request timed out and the target key was dumped, but it was default {{toString()}} output of {{ScopedKey}}. This is unfriendly to developer. The wrapped original key should be dumped.
> {noformat}
> Caused by: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [org.infinispan.registry.ScopedKey@5b6f425] for requestor [GlobalTransaction:<AAA>:1568:remote]! Lock held by [GlobalTransaction:<BBB>:1271:local]
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
10 years, 10 months
[JBoss JIRA] (ISPN-4546) Possible stale lock when the primary owner leaves during rebalance
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-4546?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-4546:
-----------------------------------------------
Sebastian Łaskawiec <slaskawi(a)redhat.com> changed the Status of [bug 1163727|https://bugzilla.redhat.com/show_bug.cgi?id=1163727] from MODIFIED to ON_QA
> Possible stale lock when the primary owner leaves during rebalance
> ------------------------------------------------------------------
>
> Key: ISPN-4546
> URL: https://issues.jboss.org/browse/ISPN-4546
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer
> Affects Versions: 7.0.0.Alpha5, 7.1.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 7.2.0.Final
>
>
> Topology T: coordinator = A, owners(k) = [C, D], pending_owners(k) = null
> B sends prepareCommand(tx1, put(k, v)) to C, D
> D adds backup locks and replies
> C acquires lock, ready to send reply to B
> A starts installing topology T+1: owners(k) = [C, D], pending_owners(k) = [C, E]
> A, C and E install topology T+1, B and D do not
> E requests and receives tx data from C, including tx1
> C leaves
> B sees a SuspectException, sends rollbackCommand(tx1) to C, D
> D removes tx1
> C has left, but is ignored
> B reports to the user that the tx has been rolled back
> B and D install topology T+1 (optional)
> A starts installing topology T+2: owners(k) = [D], pending_owners(k) = [E]
> A, B, D, E all install topology T+2
> E requests and receives state from D, but it does not remove tx1
> A starts installing topology T+3: owners(k) = [E], pending_owners(k) = null
> E now has a stale backup lock on k
> It seems very hard to reproduce in production: C would have to leave soon enough so that B and D haven't received the T+1 topology yet, but late enough for it to send its transaction data to E.
> A possible solution would be to catch any SuspectException during prepare/commit/rollback (without ignoring leavers), wait for a new topology, and replicate the command again on the new owners. Obviously, this wouldn't work with asynchronous prepare/commit/rollback.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
10 years, 10 months
[JBoss JIRA] (ISPN-5420) Thread pools are depleted by ClusterTopologyManagerImpl.waitForView() and causing deadlock
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-5420?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-5420:
-----------------------------------------------
Sebastian Łaskawiec <slaskawi(a)redhat.com> changed the Status of [bug 1208429|https://bugzilla.redhat.com/show_bug.cgi?id=1208429] from MODIFIED to ON_QA
> Thread pools are depleted by ClusterTopologyManagerImpl.waitForView() and causing deadlock
> ------------------------------------------------------------------------------------------
>
> Key: ISPN-5420
> URL: https://issues.jboss.org/browse/ISPN-5420
> Project: Infinispan
> Issue Type: Bug
> Components: Core
> Affects Versions: 6.0.2.Final, 7.1.1.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Fix For: 8.0.0.Alpha1
>
>
> The join process was designed in the idea that a node would start its caches in sequential order, so {{ClusterTopologyManager.waitForView()}} would block at most once for each joining node. However, WildFly actually starts {{2 * Runtime.availableProcessors()}} caches in parallel, and this can be a problem when the machine has a lot of cores and multiple nodes.
> {{ClustertopologyManager.handleClusterView()}} only updates the {{viewId}} after it updated the cache topologies of each cache AND after it confirmed the availability of all the nodes with a {{POLICY_GET_STATUS}} RPC. This RPC can block, and it's very easy for the remote-executor thread pool on the coordinator to become overloades with threads like this:
> {noformat}
> "remote-thread-172" daemon prio=10 tid=0x00007f0cc48c0000 nid=0x28ca4 in Object.wait() [0x00007f0c5f25b000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at org.infinispan.topology.ClusterTopologyManagerImpl.waitForView(ClusterTopologyManagerImpl.java:357)
> - locked <0x00000000ff3bd900> (a java.lang.Object)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123)
> at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:162)
> at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:144)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:276)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
10 years, 10 months
[JBoss JIRA] (ISPN-5479) NPE in RemoteCommandsFactory during removeCache
by Dennis Reed (JIRA)
Dennis Reed created ISPN-5479:
---------------------------------
Summary: NPE in RemoteCommandsFactory during removeCache
Key: ISPN-5479
URL: https://issues.jboss.org/browse/ISPN-5479
Project: Infinispan
Issue Type: Bug
Components: Core
Affects Versions: 6.0.2.Final
Reporter: Dennis Reed
NullPointerException can occur in RemoteCommandsFactory.fromStream when deserializing a RemoveCacheCommand.
It does not verify that registry.getNamedComponentRegistry(cacheName) returns an object before using it. If the cache has already been removed (by a concurrent call), it will cause a NullPointerException.
It should throw a "cache doesn't exist" or equivalent exception instead if not found.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
10 years, 10 months