[JBoss JIRA] (ISPN-3108) Server endpoint compatibility enhancements
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3108?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-3108:
--------------------------------
Status: Resolved (was: Pull Request Sent)
Resolution: Done
> Server endpoint compatibility enhancements
> ------------------------------------------
>
> Key: ISPN-3108
> URL: https://issues.jboss.org/browse/ISPN-3108
> Project: Infinispan
> Issue Type: Enhancement
> Reporter: Galder Zamarreño
> Assignee: Galder Zamarreño
> Fix For: 5.3.0.CR1, 5.3.0.Final
>
>
> Several TODOs:
> - Transaction's lookedUpEntries, affected and lockedKeys should use equivalent collections
> - -Change equivalent hash set to be based on JSR-166- <- done in https://github.com/infinispan/infinispan/commit/c714c7785a302cb228bc8f293...
> - -Make CollectionFactory a per cache component and contain references to key/value equivalence- <- deleted since it complicates the code and does not really add value
> - figure out a way to have comparable ByteArrayEquivalence implementation and uncomment tests
> - -MetadataMortalCacheValue should extend MortalCacheValue, check others- <- That should not be the case because MetadataMortalCacheValue has metadata (which contains lifespan), and extending MortalCacheValue would add lifespan too, as separate reference, so MetadataMortalCacheValue extending MortalCacheValue would add an extra field to the object without any real benefit.
> - EntryFactoryImpl.extractMetadata: "this is invoked from wrapEntryForPut and wrapEntryForReplace. Better make these two methods receive the metadata object directly from the caller, which is aware of these commands being MetadataAwareCommand, in order to avoid the not-so-nice instanceOf."
> - Have `Builder lifespan(long time);` in Metadata builder interface and same for maxIdle.
> - The simple cache.put(k,v) (and all other non-metadata bearing methods) create an immutable metadata object that always has the same (default) values for idle and expiry. This can be an single object instance instead of creating one per operation.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3130) Cancelling an InboundTransferTask should be idempotent
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3130?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-3130:
--------------------------------
Fix Version/s: 5.2.7.Final
> Cancelling an InboundTransferTask should be idempotent
> ------------------------------------------------------
>
> Key: ISPN-3130
> URL: https://issues.jboss.org/browse/ISPN-3130
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.2.7.Final, 5.3.0.Final
>
>
> When installing a new topology, it's possible for the {{StateConsumerImpl.cancelSegments}} method to fail with an {{IllegalArgumentException}}:
> {code}
> 05:43:56,513 WARN [org.infinispan.topology.CacheTopologyControlCommand] (notification-thread-0) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=default-host/clusterbench, type=CH_UPDATE, sender=perf21/web, joinInfo=null, topologyId=23, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf21/web, perf18/web, perf19/web]}, pendingCH=null, throwable=null, viewId=8}: java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:167)
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:774)
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:314)
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:195)
> at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:61)
> at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:121)
> at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:202)
> at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:165)
> at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
> at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:540)
> at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:332)
> at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:319)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
> at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
> at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
> at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
> at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]
> {code}
> Instead, it should just log a message and ignore the cancel request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3129) If the status recovery fails for a cache, it stops recovery for all the caches
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3129?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-3129:
--------------------------------
Fix Version/s: 5.2.7.Final
> If the status recovery fails for a cache, it stops recovery for all the caches
> ------------------------------------------------------------------------------
>
> Key: ISPN-3129
> URL: https://issues.jboss.org/browse/ISPN-3129
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.2.7.Final, 5.3.0.Final
>
>
> When a node becomes coordinator, it sends a GET_STATUS command to recover the status of all the caches in the cluster. Then it sends a CH_UPDATE command for each cache, to install a new topology.
> However, if the CH_UPDATE command fails for one cache, the coordinator will interrupt the recovery process, and it won't even install the cache topology for the rest of the caches locally. When a new node joins, it will act as if the cache was just started:
> {code}
> 09:05:22,542 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-1693,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf03/web, joinInfo=null, topologyId=21, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf01/web, perf02/web]}, pendingCH=null, throwable=null, viewId=7} [sender=perf03/web]
> 09:06:21,484 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-2231,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=perf04/web, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=7} [sender=perf04/web]
> 09:28:21,060 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-22206,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf04/web, joinInfo=null, topologyId=1, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf02/web, perf03/web]}, pendingCH=null, throwable=null, viewId=10} [sender=perf04/web]
> {code}
> This is mitigated by the ISPN-2966 fix (https://github.com/infinispan/infinispan/pull/1742). Since the CH_UPDATE command is now sent asynchronously, it ignores any exceptions on the targets. But on the 5.2.x branch, the CH_UPDATE command is sent synchronously, and it can fail with an exception like this:
> {code}
> 05:43:56,513 WARN [org.infinispan.topology.CacheTopologyControlCommand] (notification-thread-0) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=default-host/clusterbench, type=CH_UPDATE, sender=perf21/web, joinInfo=null, topologyId=23, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf21/web, perf18/web, perf19/web]}, pendingCH=null, throwable=null, viewId=8}: java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:167)
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:774)
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:314)
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:195)
> at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:61)
> at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:121)
> at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:202)
> at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:165)
> at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
> at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:540)
> at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:332)
> at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:319)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
> at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
> at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
> at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
> at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]
> {code}
> If that happens, cluster recovery is interrupted and some caches may end up in a corrupted state where it's impossible for new members to join.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-3120:
--------------------------------
Affects Version/s: (was: 5.2.6.Final)
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
> Key: ISPN-3120
> URL: https://issues.jboss.org/browse/ISPN-3120
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.3.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: 5.2.x
> Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3129) If the status recovery fails for a cache, it stops recovery for all the caches
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-3129?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-3129:
-------------------------------
Description:
When a node becomes coordinator, it sends a GET_STATUS command to recover the status of all the caches in the cluster. Then it sends a CH_UPDATE command for each cache, to install a new topology.
However, if the CH_UPDATE command fails for one cache, the coordinator will interrupt the recovery process, and it won't even install the cache topology for the rest of the caches locally. When a new node joins, it will act as if the cache was just started:
{code}
09:05:22,542 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-1693,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf03/web, joinInfo=null, topologyId=21, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf01/web, perf02/web]}, pendingCH=null, throwable=null, viewId=7} [sender=perf03/web]
09:06:21,484 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-2231,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=perf04/web, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=7} [sender=perf04/web]
09:28:21,060 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-22206,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf04/web, joinInfo=null, topologyId=1, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf02/web, perf03/web]}, pendingCH=null, throwable=null, viewId=10} [sender=perf04/web]
{code}
This is mitigated by the ISPN-2966 fix (https://github.com/infinispan/infinispan/pull/1742). Since the CH_UPDATE command is now sent asynchronously, it ignores any exceptions on the targets. But on the 5.2.x branch, the CH_UPDATE command is sent synchronously, and it can fail with an exception like this:
{code}
05:43:56,513 WARN [org.infinispan.topology.CacheTopologyControlCommand] (notification-thread-0) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=default-host/clusterbench, type=CH_UPDATE, sender=perf21/web, joinInfo=null, topologyId=23, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf21/web, perf18/web, perf19/web]}, pendingCH=null, throwable=null, viewId=8}: java.lang.IllegalArgumentException: The task is already cancelled.
at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:167)
at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:774)
at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:314)
at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:195)
at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:61)
at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:121)
at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:202)
at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:165)
at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:540)
at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:332)
at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:319)
at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]
{code}
If that happens, cluster recovery is interrupted and some caches may end up in a corrupted state where it's impossible for new members to join.
was:
When a node becomes coordinator, it sends a GET_STATUS command to recover the status of all the caches in the cluster. Then it sends a CH_UPDATE command for each cache, to install a new topology.
However, if the CH_UPDATE command fails for one cache, the coordinator will interrupt the recovery process, and it won't even install the cache topology for the rest of the caches locally. When a new node joins, it will act as if the cache was just started:
{code}
09:05:22,542 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-1693,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf03/web, joinInfo=null, topologyId=21, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf01/web, perf02/web]}, pendingCH=null, throwable=null, viewId=7} [sender=perf03/web]
09:06:21,484 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-2231,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=perf04/web, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=7} [sender=perf04/web]
09:28:21,060 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-22206,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf04/web, joinInfo=null, topologyId=1, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf02/web, perf03/web]}, pendingCH=null, throwable=null, viewId=10} [sender=perf04/web]
{code}
This is mitigated by the ISPN-2966 fix (https://github.com/infinispan/infinispan/pull/1742). Since the CH_UPDATE command is now sent asynchronously, it ignores any exceptions on the targets. But on the 5.2.x branch, the CH_UPDATE command is sent synchronously, and it can fail with an exception like this:
{code}
05:43:56,513 WARN [org.infinispan.topology.CacheTopologyControlCommand] (notification-thread-0) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=default-host/clusterbench, type=CH_UPDATE, sender=perf21/web, joinInfo=null, topologyId=23, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf21/web, perf18/web, perf19/web]}, pendingCH=null, throwable=null, viewId=8}: java.lang.IllegalArgumentException: The task is already cancelled.
at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:167)
at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:774)
at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:314)
at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:195)
at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:61)
at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:121)
at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:202)
at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:165)
at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:540)
at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:332)
at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:319)
at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]
{code}
> If the status recovery fails for a cache, it stops recovery for all the caches
> ------------------------------------------------------------------------------
>
> Key: ISPN-3129
> URL: https://issues.jboss.org/browse/ISPN-3129
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.3.0.Final
>
>
> When a node becomes coordinator, it sends a GET_STATUS command to recover the status of all the caches in the cluster. Then it sends a CH_UPDATE command for each cache, to install a new topology.
> However, if the CH_UPDATE command fails for one cache, the coordinator will interrupt the recovery process, and it won't even install the cache topology for the rest of the caches locally. When a new node joins, it will act as if the cache was just started:
> {code}
> 09:05:22,542 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-1693,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf03/web, joinInfo=null, topologyId=21, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf01/web, perf02/web]}, pendingCH=null, throwable=null, viewId=7} [sender=perf03/web]
> 09:06:21,484 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-2231,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=null, type=GET_STATUS, sender=perf04/web, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=7} [sender=perf04/web]
> 09:28:21,060 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-22206,shared=udp) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=default-host/clusterbench-granular, type=CH_UPDATE, sender=perf04/web, joinInfo=null, topologyId=1, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf04/web, perf02/web, perf03/web]}, pendingCH=null, throwable=null, viewId=10} [sender=perf04/web]
> {code}
> This is mitigated by the ISPN-2966 fix (https://github.com/infinispan/infinispan/pull/1742). Since the CH_UPDATE command is now sent asynchronously, it ignores any exceptions on the targets. But on the 5.2.x branch, the CH_UPDATE command is sent synchronously, and it can fail with an exception like this:
> {code}
> 05:43:56,513 WARN [org.infinispan.topology.CacheTopologyControlCommand] (notification-thread-0) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=default-host/clusterbench, type=CH_UPDATE, sender=perf21/web, joinInfo=null, topologyId=23, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf21/web, perf18/web, perf19/web]}, pendingCH=null, throwable=null, viewId=8}: java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:167)
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:774)
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:314)
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:195)
> at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:61)
> at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:121)
> at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:202)
> at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:165)
> at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
> at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:540)
> at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:332)
> at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:319)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
> at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
> at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
> at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
> at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]
> {code}
> If that happens, cluster recovery is interrupted and some caches may end up in a corrupted state where it's impossible for new members to join.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.... ]
Mircea Markus commented on ISPN-3120:
-------------------------------------
Integrated on master BTW.
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
> Key: ISPN-3120
> URL: https://issues.jboss.org/browse/ISPN-3120
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final, 5.3.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: 5.2.x
> Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.... ]
Mircea Markus commented on ISPN-3120:
-------------------------------------
[~dan.berindei] cannot be cherrypicked on 5.2.x. Can you please backport it and then close the JIRA yourself?
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
> Key: ISPN-3120
> URL: https://issues.jboss.org/browse/ISPN-3120
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final, 5.3.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: 5.2.x
> Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3130) Cancelling an InboundTransferTask should be idempotent
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-3130?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-3130:
------------------------------------
Normally the bug isn't very serious - it just logs an exception in the log. But compounded with ISPN-3129, it means that cluster recovery is interrupted and some caches may end up in a corrupted state where it's impossible for new members to join.
> Cancelling an InboundTransferTask should be idempotent
> ------------------------------------------------------
>
> Key: ISPN-3130
> URL: https://issues.jboss.org/browse/ISPN-3130
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.3.0.Final
>
>
> When installing a new topology, it's possible for the {{StateConsumerImpl.cancelSegments}} method to fail with an {{IllegalArgumentException}}:
> {code}
> 05:43:56,513 WARN [org.infinispan.topology.CacheTopologyControlCommand] (notification-thread-0) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=default-host/clusterbench, type=CH_UPDATE, sender=perf21/web, joinInfo=null, topologyId=23, currentCH=DefaultConsistentHash{numSegments=80, numOwners=2, members=[perf21/web, perf18/web, perf19/web]}, pendingCH=null, throwable=null, viewId=8}: java.lang.IllegalArgumentException: The task is already cancelled.
> at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:167)
> at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:774)
> at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:314)
> at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:195)
> at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:61)
> at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:121)
> at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:202)
> at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:165)
> at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
> at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:540)
> at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:332)
> at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:319)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
> at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
> at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
> at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
> at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]
> {code}
> Instead, it should just log a message and ignore the cancel request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-3120:
--------------------------------
Fix Version/s: 5.2.7.Final
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
> Key: ISPN-3120
> URL: https://issues.jboss.org/browse/ISPN-3120
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final, 5.3.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: 5.2.x
> Fix For: 5.2.7.Final, 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months
[JBoss JIRA] (ISPN-3120) StateConsumerImpl can ignore state received during a rebalance
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-3120?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-3120:
--------------------------------
Labels: 5.2.x (was: )
> StateConsumerImpl can ignore state received during a rebalance
> --------------------------------------------------------------
>
> Key: ISPN-3120
> URL: https://issues.jboss.org/browse/ISPN-3120
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.6.Final, 5.3.0.Beta1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Critical
> Labels: 5.2.x
> Fix For: 5.3.0.CR1, 5.3.0.Final
>
>
> This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
> 1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
> 2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
> 3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
> 4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
> The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
> 5. The StateConsumerImpl on E requests state for the "lost" segments from G.
> 6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
> 7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
> 8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
> 9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
> 10. E receives a StateResponseCommand from G with actual data, but it ignores it because {{StateConsumerImpl.updatedKeys == null}}.
> {noformat}
> 11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
> 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
> 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
> 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
> 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
> 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
> 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
> 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
> 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
> 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else
> 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
> java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
12 years, 10 months