[infinispan-issues] [JBoss JIRA] (ISPN-2836) org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
Pedro Ruivo (JIRA)
jira-events at lists.jboss.org
Wed Jun 5 05:43:55 EDT 2013
[ https://issues.jboss.org/browse/ISPN-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779422#comment-12779422 ]
Pedro Ruivo edited comment on ISPN-2836 at 6/5/13 5:42 AM:
-----------------------------------------------------------
It is simple to add a timeout parameter to the MapReduceTasks to wait for remote replies. Note that this timeout is used directly by JGroups so it is not the timeout for the task because the current implementation waits for the remote replies sequentially. Also, it is possible to wait for ever with a timeout value <= 0. The default value will be the replication timeout.
Does it make sense to you? [~afield] [~mgencur]?
was (Author: pruivo):
It is simple to add a timeout parameter to the MapReduceTasks to wait for remote replies. Note that this timeout is used directly by JGroups so it is not the timeout for the task because the current implementation waits for the replies sequentially. Also, it is possible to wait for ever with a timeout value <= 0. The default value is the replication timeout.
Does it make sense to you? [~afield] [~mgencur]?
> org.jgroups.TimeoutException after invoking MapCombineCommand in Map/Reduce task with 2 nodes
> ---------------------------------------------------------------------------------------------
>
> Key: ISPN-2836
> URL: https://issues.jboss.org/browse/ISPN-2836
> Project: Infinispan
> Issue Type: Bug
> Components: Distributed Execution and Map/Reduce
> Affects Versions: 5.2.1.Final
> Reporter: Alan Field
> Assignee: Pedro Ruivo
> Priority: Blocker
> Labels: onboard
> Fix For: 5.3.0.Final
>
> Attachments: afield-tcp-521-final.txt, benchmark-mapreduce-multifilesize.xml, dist-udp-no-tx.xml, jgroups-udp.xml, udp-edg-perf01.txt, udp-edg-perf02.txt
>
>
> Using RadarGun and two nodes to execute the example WordCount Map/Reduce job against a cache with ~550 keys with a value size of 1MB is producing a thread deadlock. The cache is distributed with transactions disabled.
> TCP transport deadlocks without throwing an exception. Disabling the send queue and setting UNICAST2.conn_expiry_timeout=0 prevents the deadlock, but the job does not complete. The nodes send "are-you-alive" messages back and forth, and I have seen the following exception:
> {noformat}
> 11:44:29,970 ERROR [org.jgroups.protocols.TCP] (OOB-98,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (76 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
> at org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
> at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
> at org.radargun.Slave$2.run(Slave.java:103)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:832)
> at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:477)
> at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
> ... 9 more
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.infinispan.util.Util.rewrapAsCacheException(Util.java:541)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)
> at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
> 11:44:29,978 ERROR [org.jgroups.protocols.TCP] (Timer-3,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (60 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:175)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:197)
> at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:254)
> at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:80)
> at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:288)
> ... 5 more
> Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
> at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)
> 11:44:29,979 ERROR [org.jgroups.protocols.TCP] (Timer-4,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (63 bytes): java.net.SocketException: Socket closed, cause: null
> at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)
> ... 11 more
> {noformat}
> With UDP transport, both threads are deadlocked. I will attach thread dumps from runs using TCP and UDP transport.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the infinispan-issues
mailing list