[JBoss JIRA] (ISPN-2836) org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
by Alan Field (JIRA)
[ https://issues.jboss.org/browse/ISPN-2836?page=com.atlassian.jira.plugin.... ]
Alan Field commented on ISPN-2836:
----------------------------------
[~pruivo] Done!
> org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
> ---------------------------------------------------------------------------------------------
>
> Key: ISPN-2836
> URL: https://issues.jboss.org/browse/ISPN-2836
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Execution and Map/Reduce
> Affects Versions: 5.2.1.Final
> Reporter: Alan Field
> Assignee: Pedro Ruivo
> Priority: Blocker
> Labels: onboard
> Fix For: 5.3.0.Final
>
> Attachments: afield-tcp-521-final.txt, benchmark-mapreduce-multifilesize.xml, dist-udp-no-tx.xml, jgroups-udp.xml, udp-edg-perf01.txt, udp-edg-perf02.txt
>
>
> Using RadarGun and two nodes to execute the example WordCount Map/Reduce job against a cache with ~550 keys with a value size of 1MB is producing a thread deadlock. The cache is distributed with transactions disabled.
> TCP transport deadlocks without throwing an exception. Disabling the send queue and setting UNICAST2.conn_expiry_timeout=0 prevents the deadlock, but the job does not complete. The nodes send "are-you-alive" messages back and forth, and I have seen the following exception:
> {noformat}
> 11:44:29,970 ERROR [org.jgroups.protocols.TCP] (OOB-98,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (76 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
> at org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
> at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
> at org.radargun.Slave$2.run(Slave.java:103)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:832)
> at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:477)
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
> ... 9 more
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.infinispan.util.Util.rewrapAsCacheException(Util.java:541)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
> 11:44:29,978 ERROR [org.jgroups.protocols.TCP] (Timer-3,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (60 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:175)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:197)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:254)
> at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:80)
> at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:288)
> ... 5 more
> Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)
> 11:44:29,979 ERROR [org.jgroups.protocols.TCP] (Timer-4,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (63 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)
> ... 11 more
> {noformat}
> With UDP transport, both threads are deadlocked. I will attach thread dumps from runs using TCP and UDP transport.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-2836) org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
by Pedro Ruivo (JIRA)
[ https://issues.jboss.org/browse/ISPN-2836?page=com.atlassian.jira.plugin.... ]
Pedro Ruivo commented on ISPN-2836:
-----------------------------------
[~afield] please mark in this JIRA. I still have to polish the test cases in Pull Request :) It is not needed to create a new JIRA.
> org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
> ---------------------------------------------------------------------------------------------
>
> Key: ISPN-2836
> URL: https://issues.jboss.org/browse/ISPN-2836
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Execution and Map/Reduce
> Affects Versions: 5.2.1.Final
> Reporter: Alan Field
> Assignee: Pedro Ruivo
> Priority: Blocker
> Labels: onboard
> Fix For: 5.3.0.Final
>
> Attachments: afield-tcp-521-final.txt, benchmark-mapreduce-multifilesize.xml, dist-udp-no-tx.xml, jgroups-udp.xml, udp-edg-perf01.txt, udp-edg-perf02.txt
>
>
> Using RadarGun and two nodes to execute the example WordCount Map/Reduce job against a cache with ~550 keys with a value size of 1MB is producing a thread deadlock. The cache is distributed with transactions disabled.
> TCP transport deadlocks without throwing an exception. Disabling the send queue and setting UNICAST2.conn_expiry_timeout=0 prevents the deadlock, but the job does not complete. The nodes send "are-you-alive" messages back and forth, and I have seen the following exception:
> {noformat}
> 11:44:29,970 ERROR [org.jgroups.protocols.TCP] (OOB-98,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (76 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
> at org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
> at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
> at org.radargun.Slave$2.run(Slave.java:103)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:832)
> at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:477)
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
> ... 9 more
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.infinispan.util.Util.rewrapAsCacheException(Util.java:541)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
> 11:44:29,978 ERROR [org.jgroups.protocols.TCP] (Timer-3,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (60 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:175)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:197)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:254)
> at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:80)
> at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:288)
> ... 5 more
> Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)
> 11:44:29,979 ERROR [org.jgroups.protocols.TCP] (Timer-4,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (63 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)
> ... 11 more
> {noformat}
> With UDP transport, both threads are deadlocked. I will attach thread dumps from runs using TCP and UDP transport.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-2836) org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
by Alan Field (JIRA)
[ https://issues.jboss.org/browse/ISPN-2836?page=com.atlassian.jira.plugin.... ]
Alan Field commented on ISPN-2836:
----------------------------------
[~pruivo] Thanks. The docs also need to updated to describe the new timeout APIs you added. The docs also need to indicate that in previous versions, Sync.replTimeout is used to set this timeout. Do you want to mark this issue as "Affects Documentation", or do you want a new JIRA?
> org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
> ---------------------------------------------------------------------------------------------
>
> Key: ISPN-2836
> URL: https://issues.jboss.org/browse/ISPN-2836
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Execution and Map/Reduce
> Affects Versions: 5.2.1.Final
> Reporter: Alan Field
> Assignee: Pedro Ruivo
> Priority: Blocker
> Labels: onboard
> Fix For: 5.3.0.Final
>
> Attachments: afield-tcp-521-final.txt, benchmark-mapreduce-multifilesize.xml, dist-udp-no-tx.xml, jgroups-udp.xml, udp-edg-perf01.txt, udp-edg-perf02.txt
>
>
> Using RadarGun and two nodes to execute the example WordCount Map/Reduce job against a cache with ~550 keys with a value size of 1MB is producing a thread deadlock. The cache is distributed with transactions disabled.
> TCP transport deadlocks without throwing an exception. Disabling the send queue and setting UNICAST2.conn_expiry_timeout=0 prevents the deadlock, but the job does not complete. The nodes send "are-you-alive" messages back and forth, and I have seen the following exception:
> {noformat}
> 11:44:29,970 ERROR [org.jgroups.protocols.TCP] (OOB-98,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (76 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
> at org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
> at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
> at org.radargun.Slave$2.run(Slave.java:103)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:832)
> at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:477)
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
> ... 9 more
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.infinispan.util.Util.rewrapAsCacheException(Util.java:541)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
> 11:44:29,978 ERROR [org.jgroups.protocols.TCP] (Timer-3,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (60 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:175)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:197)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:254)
> at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:80)
> at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:288)
> ... 5 more
> Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)
> 11:44:29,979 ERROR [org.jgroups.protocols.TCP] (Timer-4,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (63 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)
> ... 11 more
> {noformat}
> With UDP transport, both threads are deadlocked. I will attach thread dumps from runs using TCP and UDP transport.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-3140) JMX operation to suppress state transfer
by Tristan Tarrant (JIRA)
[ https://issues.jboss.org/browse/ISPN-3140?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant commented on ISPN-3140:
---------------------------------------
This needs to have a "Server" counterpart, so that it can be exposed via the Server RHQ plugin (which doesn't use JMX, but DMR)
> JMX operation to suppress state transfer
> ----------------------------------------
>
> Key: ISPN-3140
> URL: https://issues.jboss.org/browse/ISPN-3140
> Project: Infinispan
> Issue Type: Feature Request
> Components: Distributed Cache, State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Manik Surtani
> Assignee: Mircea Markus
> Fix For: 5.3.0.Final
>
>
> This feature request is to expose a JMX operation on each node, to suppress state transfer for a period of time. This flag would be {{false}} by default.
> The use case of this flag would be to ease bringing down (and up) a cluster for maintenance work. A typical workflow would be:
> 1) Shut down application requests to the data grid
> 2) Suppress state transfer on all nodes via JMX
> 3) Bring down all nodes
> 4) Perform maintenance work
> 5) Bring up nodes, one at a time. As each node comes up, disable state transfer for the node via JMX.
> 6) Once all nodes are up, enable state transfer for each node again via JMX
> 7) Allow application requests to reach the grid again.
> The purpose of this is to allow smooth and fast shutdown and startup, remove the risk of OOM errors (when bringing a grid down).
> This is a small but useful subset of full manual state transfer as defined in ISPN-1394.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-3140) JMX operation to suppress state transfer
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-3140?page=com.atlassian.jira.plugin.... ]
Adrian Nistor edited comment on ISPN-3140 at 6/7/13 10:51 AM:
--------------------------------------------------------------
Pasted from Adrian's message ([http://markmail.org/message/ns7aojy7v7su2t7p]):
{quote}
1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it suppressRehashing?) that is false by default but should also be configurable via API or xml. While this attribute is true the ClusterTopologyManager queues all join/leave/exclude(see below) requests and does not execute them on the spot as it would normally happen. [...] When it is set back to false all queued operations (except the ones that cancel eachother out) are executed. The setter should be synchronous so when setting is back to false it does not return until the queue is empty and all rehashing was processed.
2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. [...] This operation removes the node from the topology (almost as if it left) and forces a rebalance. The node is still present in the current CH but not in the pending CH. It's basically disowned by all its data which is now being transferred to other (not excluded) nodes. At the end of the rebalance the node is removed from topology for good and can be shut down without loosing data. Note that if suppressRehashing==true operation excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions and then re-activate the rehashing.
The parts that need to be implemented are written in italic above. Everything else is already there.
excludeNodes is a way of achieving a soft shutdown and should be used only if we care about preserving data int the extreme case where the nodes are the last/single owners. We can just kill the node directly if we do not care about its data.
suppressRehashing is a way of achieving some kind of batching of topology changes. This should speed up state transfer a lot because it avoids a lot of pointless reshuffling of data segments when we have many successive joiners/leavers.
So what happens if the current coordinator dies for whatever reason? The new one will take control and will not have knowledge of the existing rehash queue or the previous status of suppressRehashing attribute so it will just get the current cache membership status from all members of current view and proceed with the rehashing as usual. If the user does not want this he can set a default value of true for suppressRehashing. The admin has to interact now via JMX with the new coordinator. But that's not as bad as the alternative where all the nodes are involved in this jmx scheme :) I think having only the coordinator involved in this is a plus.
{quote}
We're actually going to implement only point 1 now, and point 2 will be a separate issue (or perhaps as a part of ISPN-1394).
was (Author: dan.berindei):
Pasted from Adrian's message ([http://markmail.org/message/ns7aojy7v7su2t7p]):
{quote}
1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it suppressRehashing?) that is false by default but should also be configurable via API or xml. While this attribute is true the ClusterTopologyManager queues all join/leave/exclude(see below) requests and does not execute them on the spot as it would normally happen. [...] When it is set back to false all queued operations (except the ones that cancel eachother out) are executed. The setter should be synchronous so when setting is back to false it does not return until the queue is empty and all rehashing was processed.
2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. [...] This operation removes the node from the topology (almost as if it left) and forces a rebalance. The node is still present in the current CH but not in the pending CH. It's basically disowned by all its data which is now being transferred to other (not excluded) nodes. At the end of the rebalance the node is removed from topology for good and can be shut down without loosing data. Note that if suppressRehashing==false operation excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions and then re-activate the rehashing.
The parts that need to be implemented are written in italic above. Everything else is already there.
excludeNodes is a way of achieving a soft shutdown and should be used only if we care about preserving data int the extreme case where the nodes are the last/single owners. We can just kill the node directly if we do not care about its data.
suppressRehashing is a way of achieving some kind of batching of topology changes. This should speed up state transfer a lot because it avoids a lot of pointless reshuffling of data segments when we have many successive joiners/leavers.
So what happens if the current coordinator dies for whatever reason? The new one will take control and will not have knowledge of the existing rehash queue or the previous status of suppressRehashing attribute so it will just get the current cache membership status from all members of current view and proceed with the rehashing as usual. If the user does not want this he can set a default value of true for suppressRehashing. The admin has to interact now via JMX with the new coordinator. But that's not as bad as the alternative where all the nodes are involved in this jmx scheme :) I think having only the coordinator involved in this is a plus.
{quote}
We're actually going to implement only point 1 now, and point 2 will be a separate issue (or perhaps as a part of ISPN-1394).
> JMX operation to suppress state transfer
> ----------------------------------------
>
> Key: ISPN-3140
> URL: https://issues.jboss.org/browse/ISPN-3140
> Project: Infinispan
> Issue Type: Feature Request
> Components: Distributed Cache, State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Manik Surtani
> Assignee: Mircea Markus
> Fix For: 5.3.0.Final
>
>
> This feature request is to expose a JMX operation on each node, to suppress state transfer for a period of time. This flag would be {{false}} by default.
> The use case of this flag would be to ease bringing down (and up) a cluster for maintenance work. A typical workflow would be:
> 1) Shut down application requests to the data grid
> 2) Suppress state transfer on all nodes via JMX
> 3) Bring down all nodes
> 4) Perform maintenance work
> 5) Bring up nodes, one at a time. As each node comes up, disable state transfer for the node via JMX.
> 6) Once all nodes are up, enable state transfer for each node again via JMX
> 7) Allow application requests to reach the grid again.
> The purpose of this is to allow smooth and fast shutdown and startup, remove the risk of OOM errors (when bringing a grid down).
> This is a small but useful subset of full manual state transfer as defined in ISPN-1394.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-3140) JMX operation to suppress state transfer
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-3140?page=com.atlassian.jira.plugin.... ]
Adrian Nistor commented on ISPN-3140:
-------------------------------------
As pointed out by Dennis Reed (http://markmail.org/message/al7elzaqme5jri22) it makes sense to forward the jmx operation from any member to the coordinator to make it easier for the user.
> JMX operation to suppress state transfer
> ----------------------------------------
>
> Key: ISPN-3140
> URL: https://issues.jboss.org/browse/ISPN-3140
> Project: Infinispan
> Issue Type: Feature Request
> Components: Distributed Cache, State transfer
> Affects Versions: 5.2.6.Final
> Reporter: Manik Surtani
> Assignee: Mircea Markus
> Fix For: 5.3.0.Final
>
>
> This feature request is to expose a JMX operation on each node, to suppress state transfer for a period of time. This flag would be {{false}} by default.
> The use case of this flag would be to ease bringing down (and up) a cluster for maintenance work. A typical workflow would be:
> 1) Shut down application requests to the data grid
> 2) Suppress state transfer on all nodes via JMX
> 3) Bring down all nodes
> 4) Perform maintenance work
> 5) Bring up nodes, one at a time. As each node comes up, disable state transfer for the node via JMX.
> 6) Once all nodes are up, enable state transfer for each node again via JMX
> 7) Allow application requests to reach the grid again.
> The purpose of this is to allow smooth and fast shutdown and startup, remove the risk of OOM errors (when bringing a grid down).
> This is a small but useful subset of full manual state transfer as defined in ISPN-1394.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-3169) Define a Query API that will be common for library mode + remote
by Adrian Nistor (JIRA)
[ https://issues.jboss.org/browse/ISPN-3169?page=com.atlassian.jira.plugin.... ]
Adrian Nistor updated ISPN-3169:
--------------------------------
Description:
Some notes on the query API:
- fluent, supporting a predefined set of filters (equality, range, etc). No user pluggable filter will be supported as this API will be available both embedded and *remotely* by the Java HR client
- the filters accept the schema(protobuf) primitive types as arguments (int, string, byte, etc..)
- should support paging: start page + pageSize
- projections are not supported
For an overview on the remote query design refer to: https://community.jboss.org/wiki/QueryingDesignInInfinispan
was:
Some notes on the query API:
- fluent, supporting a predefined set of filters (equality, range, etc). No pluggable are supported as this API will be available both embedded and *remotely* by the Java HR client
- the filters accept the schema(protobuf) primitive types as arguments (int, string, byte)
- should support paging: start page + pageSize
- projections are not supported
For an overview on the remote query design refer to: https://community.jboss.org/wiki/QueryingDesignInInfinispan
> Define a Query API that will be common for library mode + remote
> ----------------------------------------------------------------
>
> Key: ISPN-3169
> URL: https://issues.jboss.org/browse/ISPN-3169
> Project: Infinispan
> Issue Type: Feature Request
> Reporter: Mircea Markus
> Assignee: Adrian Nistor
>
> Some notes on the query API:
> - fluent, supporting a predefined set of filters (equality, range, etc). No user pluggable filter will be supported as this API will be available both embedded and *remotely* by the Java HR client
> - the filters accept the schema(protobuf) primitive types as arguments (int, string, byte, etc..)
> - should support paging: start page + pageSize
> - projections are not supported
> For an overview on the remote query design refer to: https://community.jboss.org/wiki/QueryingDesignInInfinispan
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months
[JBoss JIRA] (ISPN-2995) FineGrainedAtomicHashMap may not lock all the composite keys in optimistic locking
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-2995?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-2995:
-------------------------------
Status: Resolved (was: Pull Request Sent)
Fix Version/s: 5.3.0.CR2
Resolution: Done
> FineGrainedAtomicHashMap may not lock all the composite keys in optimistic locking
> ----------------------------------------------------------------------------------
>
> Key: ISPN-2995
> URL: https://issues.jboss.org/browse/ISPN-2995
> Project: Infinispan
> Issue Type: Bug
> Components: Fine-grained API
> Affects Versions: 5.3.0.Alpha1
> Reporter: Pedro Ruivo
> Assignee: Pedro Ruivo
> Labels: atomic_map, locking
> Fix For: 5.3.0.CR2, 5.3.0.Final
>
>
> In OptimisticLockingInterceptor, we are collecting the composite keys from ApplyDeltaCommand is the key belongs to the node:
> {code}
> case ApplyDeltaCommand.COMMAND_ID:
> ApplyDeltaCommand command = (ApplyDeltaCommand) wc;
> if (cdl.localNodeIsOwner(command.getKey())) {
> Object[] compositeKeys = command.getCompositeKeys();
> set.addAll(Arrays.asList(compositeKeys));
> }
> break;
> {code}
> However, when we are going to acquire the lock in the node if it is the primary owner:
> {code}
> protected final void lockAndRegisterBackupLock(TxInvocationContext ctx, Object key, long lockTimeout, boolean skipLocking) throws InterruptedException {
> if (cdl.localNodeIsPrimaryOwner(key)) {
> lockKeyAndCheckOwnership(ctx, key, lockTimeout, skipLocking);
> } else if (cdl.localNodeIsOwner(key)) {
> ctx.getCacheTransaction().addBackupLockForKey(key);
> }
> }
> {code}
> The CompositeKey should always acquire the lock.
> This is probably a bug. Add an unit test to verify that locking works as expected for AtomicHashMap.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
11 years, 7 months