[infinispan-issues] [JBoss JIRA] (ISPN-2836) Thread deadlock in Map/Reduce with 2 nodes

Wed Feb 27 10:12:56 EST 2013

    [ https://issues.jboss.org/browse/ISPN-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757218#comment-12757218 ] 

Bela Ban commented on ISPN-2836:
--------------------------------

IRC with Alan:
{quote}
[4:01pm] bela: There's no deadlock: most threads are parked
[4:01pm] bela: and the multicast and unicast receiver are waiting for data to be received
[4:01pm] bela: ah, ok
[4:02pm] bela: This is a *completely idle* system !
[4:02pm] afield: OK, so do you think the issue is somewhere in the Map/Reduce code?
[4:02pm] bela: yes
[4:02pm] afield: Because somehow it isn't thinking the job has completed
[4:03pm] bela: I don't see a main thread ?
[4:03pm] afield: OK, I will see if I can do some remote debugging to see where it is stuck
[4:03pm] bela: I do see gang workers, but there's no useful info there
[4:03pm] bela: ok, cool
[4:03pm] bela: add comments to the case
[4:03pm] bela: I believe the JGroups issue is only happeninig in TCP mode
[4:03pm] afield: And the TCP configuration *does* show a deadlock?
[4:04pm] afield: OK, sorry typing past each other!
[4:04pm] bela: no, but it shows writers are blocking on a send queue
[4:04pm] bela: which isn't serviced by a reader
[4:04pm] bela: use_send_queues was off ?
[4:04pm] afield: Yes it was
[4:05pm] bela: Sorry, Alan, I don't believe you ! 
[4:06pm] afield: Uh-oh, what is your clue?
[4:06pm] afield: I can look for my config file to verify
[4:06pm] bela: in
[4:06pm] bela: afield-tcp-521-final.txt:
[4:06pm] bela: at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:306)
[4:06pm] bela: 	at org.jgroups.blocks.TCPConnectionMap$TCPConnection$Sender.addToQueue(TCPConnectionMap.java:615)
[4:06pm] bela: 	at org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(TCPConnectionMap.java:451)
[4:06pm] bela: 	at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:174)
[4:07pm] bela: This does use a send queue
[4:07pm] bela: I assume with send queues being disabled, this *should* work
[4:07pm] bela: and also with UDP
[4:07pm] afield: OK, I'll check my config file and rerun
[4:08pm] bela: BTW: the gang workers are all in RUNNABLE states, so they're doing *something*, but I can't see what as there's only 1 line on the trace
[4:08pm] bela: ok, thx
[4:09pm] afield: I see use_send_queues="false" in the config file. That's what I need, right?
[4:09pm] bela: yes, but that's no what *was* used ha ha 
[4:10pm] afield: OK, I'll run again. Thanks
[4:10pm] bela: ok. I'll copy this conv into the case, please update the case
[4:10pm] bela: cheers
{quote}

> Thread deadlock in Map/Reduce with 2 nodes
> ------------------------------------------
>
>                 Key: ISPN-2836
>                 URL: https://issues.jboss.org/browse/ISPN-2836
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Distributed Execution and Map/Reduce
>    Affects Versions: 5.2.1.Final
>            Reporter: Alan Field
>            Assignee: Vladimir Blagojevic
>         Attachments: afield-tcp-521-final.txt, udp-edg-perf01.txt, udp-edg-perf02.txt
>
>
> Using RadarGun and two nodes to execute the example WordCount Map/Reduce job against a cache with ~550 keys with a value size of 1MB is producing a thread deadlock. The cache is distributed with transactions disabled. 
> TCP transport deadlocks without throwing an exception. Disabling the send queue and setting UNICAST2.conn_expiry_timeout=0 prevents the deadlock, but the job does not complete. The nodes send "are-you-alive" messages back and forth, and I have seen the following exception:
> {noformat}
> 11:44:29,970 ERROR [org.jgroups.protocols.TCP] (OOB-98,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (76 bytes): java.net.SocketException: Socket closed, cause: null
>         at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:352)
>         at org.radargun.cachewrappers.InfinispanMapReduceWrapper.executeMapReduceTask(InfinispanMapReduceWrapper.java:98)
>         at org.radargun.stages.MapReduceStage.executeOnSlave(MapReduceStage.java:74)
>         at org.radargun.Slave$2.run(Slave.java:103)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.util.concurrent.ExecutionException: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
>         at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>         at org.infinispan.distexec.mapreduce.MapReduceTask$TaskPart.get(MapReduceTask.java:832)
>         at org.infinispan.distexec.mapreduce.MapReduceTask.executeMapPhaseWithLocalReduction(MapReduceTask.java:477)
>         at org.infinispan.distexec.mapreduce.MapReduceTask.execute(MapReduceTask.java:350)
>         ... 9 more
> Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
>         at org.infinispan.util.Util.rewrapAsCacheException(Util.java:541)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186)
>         at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515)
> 11:44:29,978 ERROR [org.jgroups.protocols.TCP] (Timer-3,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (60 bytes): java.net.SocketException: Socket closed, cause: null
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:175)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:197)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:254)
>         at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:80)
>         at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:288)
>         ... 5 more
> Caused by: org.jgroups.TimeoutException: timeout sending message to edg-perf02-32536
>         at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:390)
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:301)
> 11:44:29,979 ERROR [org.jgroups.protocols.TCP] (Timer-4,default,edg-perf01-1907) failed sending message to edg-perf02-32536 (63 bytes): java.net.SocketException: Socket closed, cause: null
>         at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:179)
>         ... 11 more
> {noformat}
> With UDP transport, both threads are deadlocked. I will attach thread dumps from runs using TCP and UDP transport.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira