[JBoss JIRA] (ISPN-2510) PrepareCommands should fail on nodes where the cache is not running
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2510?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2510:
--------------------------------
Fix Version/s: 5.3.0.Final
(was: 5.2.0.Final)
> PrepareCommands should fail on nodes where the cache is not running
> -------------------------------------------------------------------
>
> Key: ISPN-2510
> URL: https://issues.jboss.org/browse/ISPN-2510
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Cache, RPC
> Affects Versions: 5.2.0.Beta3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.3.0.Final
>
>
> When the user stops a cache without stopping the cache manager on that node, subsequent PrepareCommands sent to that node will return a {{SuccessfulResponse}}.
> If that node used to the primary owner of the command's modified key, the originator will proceed with the transaction as if it had acquired a lock on that key. It is thus possible for multiple transactions to think they have acquired the key lock at the same time.
> On the other hand, in replicated caches is is quite possible that a cache is not running on all the cluster node and yet PrepareCommands are broadcasted to everyone in parallel. So the solution should not involve sending exceptions (which have huge stack traces), and the originator should be able to ignore failures responses from nodes that were not targeted in the first place.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2503) Re-enable FD_SOCK in the test suite
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2503?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2503:
--------------------------------
Priority: Major (was: Critical)
> Re-enable FD_SOCK in the test suite
> -----------------------------------
>
> Key: ISPN-2503
> URL: https://issues.jboss.org/browse/ISPN-2503
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite
> Affects Versions: 5.2.0.Beta3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.2.0.Final
>
>
> Some tests fail randomly with a timeout waiting for a new view after stopping the coordinator:
> {noformat}
> 01:08:11,695 ERROR (testng-CacheClusterJoinTest:) [UnitTestTestNGListener] Test testIsCoordinator(org.infinispan.api.CacheClusterJoinTest) failed.
> java.lang.RuntimeException: Timed out before caches had complete views. Expected 1 members in each view. Views are as follows: [[NodeC-27739, NodeD-5092]]
> at org.infinispan.test.TestingUtil.viewsTimedOut(TestingUtil.java:249)
> at org.infinispan.test.TestingUtil.blockUntilViewsReceived(TestingUtil.java:311)
> at org.infinispan.api.CacheClusterJoinTest.testIsCoordinator(CacheClusterJoinTest.java:87)
> {noformat}
> This happens because the old coordinator tries to install a new view without it before stopping, but fails:
> {noformat}
> 01:07:21,616 WARN (ViewHandler,ISPN,NodeC-27739:) [GMS] NodeC-27739: failed to collect all ACKs (expected=1) for view [NodeD-5092|2] after 2000ms, missing ACKs from [NodeD-5092]
> {noformat}
> The survivor never received the view installation message, so it didn't install view {{[NodeD-5092|2]}}. Because it didn't have any failure detection, it couldn't realize that the current coordinator was dead so it never installed a new view.
> It's not clear why the survivor didn't receive the view message at all in the test suite, but this can obviously happen so we should enable FD_SOCK in the test suite.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2579) StateResponseCommand received after the node is removed from CH causes IllegalArgumentException
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2579?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2579:
--------------------------------
Fix Version/s: 5.3.0.Final
(was: 5.2.0.Final)
> StateResponseCommand received after the node is removed from CH causes IllegalArgumentException
> -----------------------------------------------------------------------------------------------
>
> Key: ISPN-2579
> URL: https://issues.jboss.org/browse/ISPN-2579
> Project: Infinispan
> Issue Type: Bug
> Components: State transfer
> Affects Versions: 5.2.0.Beta5
> Reporter: Radim Vansa
> Assignee: Adrian Nistor
> Priority: Minor
> Fix For: 5.3.0.Final
>
>
> When a node requests ST and then it receives a CH where this node is not a member, it sends ST request CANCEL_STATE_TRANSFER - however, if the StateResponseCommand is already on its way and reaches the node, it causes
> {code}
> java.lang.IllegalArgumentException: Node hyperion947-55285 is not a member
> at org.infinispan.distribution.ch.DefaultConsistentHash.getSegmentsForOwner(DefaultConsistentHash.java:102)
> at org.infinispan.statetransfer.StateConsumerImpl.applyState(StateConsumerImpl.java:272)
> at org.infinispan.statetransfer.StateResponseCommand.perform(StateResponseCommand.java:86)
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2578) Two PrepareCommands in parallel cause ConcurrentModificationException
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2578?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2578:
--------------------------------
Fix Version/s: 5.2.0.CR2
> Two PrepareCommands in parallel cause ConcurrentModificationException
> ---------------------------------------------------------------------
>
> Key: ISPN-2578
> URL: https://issues.jboss.org/browse/ISPN-2578
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 5.2.0.Beta5
> Reporter: Radim Vansa
> Assignee: Adrian Nistor
> Priority: Critical
> Fix For: 5.2.0.CR2, 5.2.0.Final
>
>
> Situation:
> 1) Node A broadcasts PrepareCommand to nodes B, C
> 2) Node A leaves cluster, causing new topology to be installed
> 3) The command arrives to B and C, with lower topology than the current one
> 4) Both B and C forward the command to node D
> 5) D executes the two commands in parallel and finds out that A has left, therefore executing RollbackCommand
> In {{AbstractTxLockingInterceptor.visitRollbackCommand}} we call {{LockManagerImpl.unlockAll}} which iterates over the keys and unlocks them. As these two prepares aren't synchronized over the {{lockedKeys}} set, one may unlock and remove these keys while the other is iterating through them, causing {{ConcurrentModificationException}}.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2578) Two PrepareCommands in parallel cause ConcurrentModificationException
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2578?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2578:
--------------------------------
Fix Version/s: (was: 5.2.0.Final)
> Two PrepareCommands in parallel cause ConcurrentModificationException
> ---------------------------------------------------------------------
>
> Key: ISPN-2578
> URL: https://issues.jboss.org/browse/ISPN-2578
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 5.2.0.Beta5
> Reporter: Radim Vansa
> Assignee: Adrian Nistor
> Priority: Critical
> Fix For: 5.2.0.CR2
>
>
> Situation:
> 1) Node A broadcasts PrepareCommand to nodes B, C
> 2) Node A leaves cluster, causing new topology to be installed
> 3) The command arrives to B and C, with lower topology than the current one
> 4) Both B and C forward the command to node D
> 5) D executes the two commands in parallel and finds out that A has left, therefore executing RollbackCommand
> In {{AbstractTxLockingInterceptor.visitRollbackCommand}} we call {{LockManagerImpl.unlockAll}} which iterates over the keys and unlocks them. As these two prepares aren't synchronized over the {{lockedKeys}} set, one may unlock and remove these keys while the other is iterating through them, causing {{ConcurrentModificationException}}.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2587) Optimize command forwarding after topology changes
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2587?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2587:
--------------------------------
Fix Version/s: 5.3.0.Final
(was: 5.2.0.Final)
> Optimize command forwarding after topology changes
> --------------------------------------------------
>
> Key: ISPN-2587
> URL: https://issues.jboss.org/browse/ISPN-2587
> Project: Infinispan
> Issue Type: Task
> Components: State transfer
> Affects Versions: 5.2.0.Beta5
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 5.3.0.Final
>
>
> When a node receives a command with a topology id lower than its own topology id, it forwards the command to all the owners in the current topology.
> This is especially bad in replicated caches, where all the nodes check whether to forward or not, and after a join we may get {{n * (n-1)}} forwarded commands instead of just {{n}}.
> Most of the time the difference between the current topology id and the command's topology id is <= 1, so we could avoid a lot of the extra forwarding if we kept the previous cache topology and we forwarded the command only to the owners added in the latest topology. Obviously, if the command is older we'd still forward it to all the owners.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2612) Problem broadcasting CH_UPDATE command
by Mircea Markus (JIRA)
[ https://issues.jboss.org/browse/ISPN-2612?page=com.atlassian.jira.plugin.... ]
Mircea Markus updated ISPN-2612:
--------------------------------
Priority: Blocker (was: Critical)
> Problem broadcasting CH_UPDATE command
> --------------------------------------
>
> Key: ISPN-2612
> URL: https://issues.jboss.org/browse/ISPN-2612
> Project: Infinispan
> Issue Type: Bug
> Components: RPC
> Affects Versions: 5.2.0.Beta5
> Reporter: Michal Linhard
> Assignee: Dan Berindei
> Priority: Blocker
> Fix For: 5.2.0.CR2
>
> Attachments: session-cluster.xml, test.zip
>
>
> Infinispan 5.2.0.Beta5
> JGroups 3.2.4.Final
> Steps to reproduce (I'm using two virtual interfaces test1, test2)
> 1. Start org.jboss.qa.jdg.Test with -Djgroups.udp.bind_addr=test1 -Djava.net.preferIPv4Stack=true
> 2. wait 10 sec
> 3. Start org.jboss.qa.jdg.Test with -Djgroups.udp.bind_addr=test2 -Djava.net.preferIPv4Stack=true
> After 5 seconds there should be this timeout exception:
> {code}
> 19:42:14,146 WARN [org.infinispan.topology.CacheTopologyControlCommand] (OOB-2,mlinhard-work-37329) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=___defaultcache, type=REBALANCE_CONFIRM, sender=mlinhard-work-47337, joinInfo=null, topologyId=1, currentCH=null, pendingCH=null, throwable=null, viewId=1}
> java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: TimeoutException
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
> at java.util.concurrent.FutureTask.get(FutureTask.java:91)
> at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:563)
> at org.infinispan.topology.ClusterTopologyManagerImpl.broadcastConsistentHashUpdate(ClusterTopologyManagerImpl.java:349)
> at org.infinispan.topology.ClusterTopologyManagerImpl.handleRebalanceCompleted(ClusterTopologyManagerImpl.java:213)
> at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:160)
> at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:137)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:252)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:219)
> at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:483)
> at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:390)
> at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:248)
> at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:598)
> at org.jgroups.JChannel.up(JChannel.java:703)
> at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)
> at org.jgroups.protocols.RSVP.up(RSVP.java:172)
> at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
> at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
> at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
> at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)
> at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:244)
> at org.jgroups.protocols.UNICAST2.handleDataReceived(UNICAST2.java:736)
> at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:414)
> at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:606)
> at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)
> at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:187)
> at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
> at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)
> at org.jgroups.protocols.Discovery.up(Discovery.java:359)
> at org.jgroups.protocols.TP.passMessageUp(TP.java:1287)
> at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1850)
> at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1823)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: TimeoutException
> at org.infinispan.util.Util.rewrapAsCacheException(Util.java:532)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:152)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:518)
> at org.infinispan.topology.ClusterTopologyManagerImpl$2.call(ClusterTopologyManagerImpl.java:545)
> at org.infinispan.topology.ClusterTopologyManagerImpl$2.call(ClusterTopologyManagerImpl.java:542)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> ... 3 more
> Caused by: org.jgroups.TimeoutException: TimeoutException
> at org.jgroups.util.Promise._getResultWithTimeout(Promise.java:145)
> at org.jgroups.util.Promise.getResultWithTimeout(Promise.java:40)
> at org.jgroups.util.AckCollector.waitForAllAcks(AckCollector.java:93)
> at org.jgroups.protocols.RSVP$Entry.block(RSVP.java:287)
> at org.jgroups.protocols.RSVP.down(RSVP.java:118)
> at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025)
> at org.jgroups.JChannel.down(JChannel.java:718)
> at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:616)
> at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:173)
> at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:360)
> at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:103)
> at org.jgroups.blocks.Request.execute(Request.java:83)
> at org.jgroups.blocks.MessageDispatcher.cast(MessageDispatcher.java:335)
> at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:249)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processCalls(CommandAwareRpcDispatcher.java:330)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:145)
> ... 8 more
> {code}
> Analysis:
> These are the messages sent after view change:
> {code}
> test1 test2
> <--- JOIN ----
> ---- REBALANCE_START --->
> <--- StateRequestCommand ----
> ---- StateResponseCommand --->
> <--- REBALANCE_CONFIRM ----
> ---- CH_UPDATE --->
> {code}
> The last CH_UPDATE message is broadcast, test2 successfully processes it, but test1 stays in waiting state, because it for some reason awaits response also from itself - local variable entry in the method RSVP.down
> (https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/...)
> contained local address.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months
[JBoss JIRA] (ISPN-2697) HotRodServer startup fails when its record cannot be inserted into topology cache
by Bela Ban (JIRA)
[ https://issues.jboss.org/browse/ISPN-2697?page=com.atlassian.jira.plugin.... ]
Bela Ban commented on ISPN-2697:
--------------------------------
Yes. The idea is to increase the interval at which a STABLE gossip is sent with increasing cluster size, in order to keep the average STABLE gossip rate for the entire cluster about the same.
If we didn't increase the interval, the gossip rate for the entire cluster would increase.
If you have a scenario that requires immediate ack'ed delivery, the tag the message with RSVP.
Infinispan does this is some places.
> HotRodServer startup fails when its record cannot be inserted into topology cache
> ---------------------------------------------------------------------------------
>
> Key: ISPN-2697
> URL: https://issues.jboss.org/browse/ISPN-2697
> Project: Infinispan
> Issue Type: Bug
> Components: Remote protocols
> Affects Versions: 5.2.0.Beta6
> Reporter: Radim Vansa
> Assignee: Galder Zamarreño
> Priority: Critical
> Fix For: 5.2.0.CR2
>
>
> When the HotRodServer starts it inserts its record to __hotRodTopologyCache ({{HotRodServer.addSelfToTopologyView(...)}}).
> However, this put may very easily fail - as the command is broadcasted using NAKACK2 protocol, if the message gets lost and there's no following broadcasted message, the message will be not retransmitted and the put operation times out (Replication timeout), which fails the whole HotRodServer startup, all because of one lost UDP message.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
13 years, 3 months