[JBoss JIRA] (ISPN-2836) Thread deadlock in Map/Reduce with 2 nodes

Wednesday, 27 February 2013

    [
https://issues.jboss.org/browse/ISPN-2836?page=com.atlassian.jira.plugin....
] 

Alan Field commented on ISPN-2836:
----------------------------------

OK, I am running the job again in Jenkins using the TCP configuration.
(https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/afield-radargun-mapr...)

It is not generating a deadlock, but the Map/Reduce job with two nodes does not complete.
With one node executing the job, I see this sequence of log messages: (From
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/afield-radargun-mapr...)

{noformat}
11:09:54,934 INFO  [org.radargun.Slave] (pool-1-thread-1) Executing stage: MapReduce
{collatorFqn=null, exitBenchmarkOnSlaveFailure=false,
mapperFqn=org.infinispan.demo.mapreduce.WordCountMapper,
reducerFqn=org.infinispan.demo.mapreduce.WordCountReducer, runOnAllSlaves=false,
slaves=null, useSmartClassLoading=true }
11:09:54,935 INFO  [org.radargun.stages.MapReduceStage] (pool-1-thread-1)
--------------------
11:09:54,944 INFO  [org.radargun.utils.ClassLoadHelper] (pool-1-thread-1) Creating
newInstance org.infinispan.demo.mapreduce.WordCountMapper with classloader
java.net.URLClassLoader@4bf54c5f
11:09:54,948 INFO  [org.radargun.utils.ClassLoadHelper] (pool-1-thread-1) Creating
newInstance org.infinispan.demo.mapreduce.WordCountReducer with classloader
java.net.URLClassLoader@4bf54c5f
11:09:55,188 DEBUG [org.infinispan.distexec.mapreduce.MapReduceTask] (transport-thread-0)
Invoking MapCombineCommand [keys=[], taskId=4d6ddf79-d919-4db6-ae34-20a25a158434] locally
11:11:10,174 DEBUG [org.infinispan.distexec.mapreduce.MapReduceTask] (transport-thread-0)
Invoked MapCombineCommand [keys=[], taskId=4d6ddf79-d919-4db6-ae34-20a25a158434] locally
11:11:47,849 INFO  [org.radargun.stages.MapReduceStage] (pool-1-thread-1) MapReduce task
completed in 112.91 seconds
{noformat}

The RadarGun stage starts. It instantiates the Mapper and Reducer class, then the task
completes after MapCombineCommand is invoked.

When two nodes are executing the Map/Reduce job, an exception occurs after
MapCombineCommand is invoked on the second node in the cluster. The "MapReduce task
completed" from the RadarGun stage does not appear.

{noformat}
11:12:45,617 INFO  [org.radargun.Slave] (pool-1-thread-1) Executing stage: MapReduce
{collatorFqn=null, exitBenchmarkOnSlaveFailure=false,
mapperFqn=org.infinispan.demo.mapreduce.WordCountMapper,
reducerFqn=org.infinispan.demo.mapreduce.WordCountReducer, runOnAllSlaves=false,
slaves=null, useSmartClassLoading=true }
11:12:45,618 INFO  [org.radargun.stages.MapReduceStage] (pool-1-thread-1)
--------------------
11:12:45,618 INFO  [org.radargun.utils.ClassLoadHelper] (pool-1-thread-1) Creating
newInstance org.infinispan.demo.mapreduce.WordCountMapper with classloader
java.net.URLClassLoader@4bf54c5f
11:12:45,618 INFO  [org.radargun.utils.ClassLoadHelper] (pool-1-thread-1) Creating
newInstance org.infinispan.demo.mapreduce.WordCountReducer with classloader
java.net.URLClassLoader@4bf54c5f
11:12:45,622 DEBUG [org.infinispan.distexec.mapreduce.MapReduceTask] (transport-thread-5)
Invoking MapCombineCommand [keys=[], taskId=262e0a68-ee5c-4534-8458-97340158b129] locally
11:12:45,622 DEBUG [org.infinispan.distexec.mapreduce.MapReduceTask] (pool-1-thread-1)
Invoking MapCombineCommand [keys=[], taskId=262e0a68-ee5c-4534-8458-97340158b129] on
edg-perf02-43110
11:12:45,625 DEBUG [org.infinispan.distexec.mapreduce.MapReduceTask] (pool-1-thread-1)
Invoked MapCombineCommand [keys=[], taskId=262e0a68-ee5c-4534-8458-97340158b129] on
edg-perf02-43110 
11:12:46,228 DEBUG [org.jgroups.protocols.FD] (Timer-4,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:12:49,230 DEBUG [org.jgroups.protocols.FD] (Timer-2,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:12:52,232 DEBUG [org.jgroups.protocols.FD] (Timer-2,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:12:55,233 DEBUG [org.jgroups.protocols.FD] (Timer-4,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:12:58,233 DEBUG [org.jgroups.protocols.FD] (Timer-3,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:01,235 DEBUG [org.jgroups.protocols.FD] (Timer-2,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:04,237 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:07,237 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:10,238 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:13,963 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:13,964 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) heartbeat
missing from edg-perf02-43110 (number=0)
11:13:16,965 DEBUG [org.jgroups.protocols.FD] (Timer-3,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:19,966 DEBUG [org.jgroups.protocols.FD] (Timer-4,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:22,289 DEBUG [org.infinispan.distexec.mapreduce.MapReduceTask] (transport-thread-5)
Invoked MapCombineCommand [keys=[], taskId=262e0a68-ee5c-4534-8458-97340158b129] locally
11:13:22,966 DEBUG [org.jgroups.protocols.FD] (Timer-4,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:25,967 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:28,969 DEBUG [org.jgroups.protocols.FD] (Timer-5,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:13:31,970 DEBUG [org.jgroups.protocols.FD] (Timer-4,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
11:15:03,446 DEBUG [org.jgroups.protocols.FD] (Timer-4,default,edg-perf01-54809) sending
are-you-alive msg to edg-perf02-43110 (own address=edg-perf01-54809)
Exception in thread "main" java.util.concurrent.ExecutionException:
org.infinispan.CacheException: java.util.concurrent.ExecutionException:
org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to
edg-perf02-43110
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
	at java.util.concurrent.FutureTask.get(FutureTask.java:91)
	at org.radargun.Slave.startCommunicationWithMaster(Slave.java:124)
	at org.radargun.Slave.start(Slave.java:67)
	at org.radargun.Slave.main(Slave.java:230)
Caused by: org.infinispan.CacheException: java.util.concurrent.ExecutionException:
org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to
edg-perf02-43110
	at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
	at
org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
	at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
	at org.radargun.Slave$2.run(Slave.java:103)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException:
org.jgroups.TimeoutException: timeout sending message to edg-perf02-43110
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
	at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:829)
	at
org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:474)
	at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
	... 9 more
Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending
message to edg-perf02-43110
	at org.infinispan.util.Util.rewrapAsCacheException(Util.java:532)
	at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:185)
	at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:180)
11:15:03,459 WARN  [org.jgroups.protocols.FD] (OOB-90,default,edg-perf01-54809) I was
suspected by edg-perf02-43110; ignoring the SUSPECT message and sending back a
HEARTBEAT_ACK
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:202)
	at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:259)
	at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:82)
	at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:293)
	... 5 more
Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-43110
	at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
	at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:299)
	at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:178)
	... 11 more
{noformat}

After this exception occurs, a number of message sends fail, then a MERGE event and a
VIEW_CHANGE occur. After this point, the two nodes send a stream of
"are-you-alive" messages to each other, and the RadarGun stage never completes.
Jenkins eventually kills the job. I will run this job again using the UDP configuration,
but I think the behavior will be the same, since the org.jgroups.TimeoutException is also
seen above.

...
 Thread deadlock in Map/Reduce with 2 nodes
 ------------------------------------------

                 Key: ISPN-2836
                 URL: https://issues.jboss.org/browse/ISPN-2836
             Project: Infinispan
          Issue Type: Bug
          Components: Distributed Execution and Map/Reduce
    Affects Versions: 5.2.1.Final
            Reporter: Alan Field
            Assignee: Vladimir Blagojevic
         Attachments: afield-tcp-521-final.txt, udp-edg-perf01.txt, udp-edg-perf02.txt

 Using RadarGun and two nodes to execute the example WordCount Map/Reduce job against a
cache with ~550 keys with a value size of 1MB is producing a thread deadlock. The cache is
distributed with transactions disabled. 
 TCP transport deadlocks without throwing an exception. Disabling the send queue and
setting UNICAST2.conn_expiry_timeout=0 prevents the deadlock, but the job does not
complete. The nodes send "are-you-alive" messages back and forth, and I have
seen the following exception:
 {noformat}
 11:44:29,970 ERROR [org.jgroups.protocols.TCP] (OOB-98,default,edg-perf01-1907) failed
sending message to edg-perf02-32536 (76 bytes): java.net.SocketException: Socket closed,
cause: null
         at
org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
         at
org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
         at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
         at org.radargun.Slave$2.run(Slave.java:103)
         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
 Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException:
org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
         at java.util.concurrent.FutureTask.get(FutureTask.java:83)
         at
org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:832)
         at
org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:477)
         at
org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
         ... 9 more
 Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending
message to edg-perf02-32536
         at org.infinispan.util.Util.rewrapAsCacheException(Util.java:541)
         at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)
         at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
 11:44:29,978 ERROR [org.jgroups.protocols.TCP] (Timer-3,default,edg-perf01-1907) failed
sending message to edg-perf02-32536 (60 bytes): java.net.SocketException: Socket closed,
cause: null
         at
org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:175)
         at
org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:197)
         at
org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:254)
         at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:80)
         at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:288)
         ... 5 more
 Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
         at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
         at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)
 11:44:29,979 ERROR [org.jgroups.protocols.TCP] (Timer-4,default,edg-perf01-1907) failed
sending message to edg-perf02-32536 (63 bytes): java.net.SocketException: Socket closed,
cause: null
         at
org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)
         ... 11 more
 {noformat}
 With UDP transport, both threads are deadlocked. I will attach thread dumps from runs
using TCP and UDP transport. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009