[jboss-jira] [JBoss JIRA] Commented: (JGRP-1003) deadlock in TP.send - socket write hangs inside a Lock, preventing other threads to do TP.send()

Victor N (JIRA) jira-events at lists.jboss.org
Wed Apr 14 02:34:25 EDT 2010


    [ https://jira.jboss.org/jira/browse/JGRP-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12525549#action_12525549 ] 

Victor N commented on JGRP-1003:
--------------------------------

I compared logs from this problematic node with logs from coordinator and logs from the neighbor sending heartbeats. What I see:

4) the "problematic" log is much bigger in size (270 Mb vs 20 Mb!)
There are many many UNICAST.retransmit() calls in it! Almost endlessly.
Logs from the neighbor confirms this (frequent XMIT_REQ from the problematic node).

6) (maybe this is interesting) from the coordinator's log I can see (we have 11 nodes in our cluster, gate3/node3 is the coordinator):

14/04/2010 06:04:14 UTC| TRACE | org.jgroups.protocols.MERGE2$FindSubgroupsTask.findAndNotify(): Discovery results:
[gate9.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate6.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate7.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate5.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate8.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate11.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate2.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate4.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate12.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate10.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]
[gate3.mydomain]: MergeView::[gate3.mydomain|6831] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate9.mydomain, gate12.mydomain], subgroups=[[gate3.mydomain|6829] [gate9.mydomain], [gate3.mydomain|6830] [gate3.mydomain, gate5.mydomain, gate4.mydomain, gate11.mydomain, gate2.mydomain, gate10.mydomain, gate7.mydomain, gate8.mydomain, gate6.mydomain, gate12.mydomain]]

> deadlock in TP.send - socket write hangs inside a Lock, preventing other threads to do TP.send()
> ------------------------------------------------------------------------------------------------
>
>                 Key: JGRP-1003
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1003
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.7
>         Environment: Linux (Debian), kernel 2.6.x
>            Reporter: Victor N
>            Assignee: Bela Ban
>             Fix For: 2.10
>
>         Attachments: j1__.txt, jgroups_s2.txt, jgroups_stacktrace_09feb2010.txt, stacktrace.txt
>
>
> I am using JGroups 2.7.0 GA with the typical protocols stack based on TCP (taken from tcp.xml).
> Sometimes it occurs that socket write operation hangs inside TP.send() -- this is not so untypical for blocking I/O approach! -- after that JGroups is not working at all because most of threads are blocked inside TP.send() waiting for that first thread releasing "out_stream_lock". Below is a fragment from my stack trace.
> 1) the thread that hangs in socket write until forever (maybe due to a broken socket or something else); this thread owns  "out_stream_lock in TP class":
> "Timer-1,name,IP:port" daemon prio=10 tid=0x084e5c00 nid=0x15f0 runnable [0x211c1000..0x211c2040]                                                             
>    java.lang.Thread.State: RUNNABLE                                                                                                                                              
>         at java.net.SocketOutputStream.socketWrite0(Native Method)                                                                                                               
>         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)                                                                                                   
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)                                                                                                        
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)                                                                                                
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)                                                                                                     
>         - locked <0x43427300> (a java.io.BufferedOutputStream)                                                                                                                   
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)                                                                                                             
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection.doSend(TCPConnectionMap.java:472)                                                                                   
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection._send(TCPConnectionMap.java:453)                                                                                    
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(TCPConnectionMap.java:436)                                                                                     
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection.access$100(TCPConnectionMap.java:341)                                                                               
>         at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:137)                                                                                                   
>         at org.jgroups.protocols.TCP.send(TCP.java:53)                                                                                                                           
>         at org.jgroups.protocols.BasicTCP.sendToSingleMember(BasicTCP.java:141)                                                                                                  
>         at org.jgroups.protocols.TP.doSend(TP.java:1105)                                                                                                                         
>         at org.jgroups.protocols.TP.send(TP.java:1088)                                                                                                                           
>         at org.jgroups.protocols.TP.down(TP.java:907)                                                                                                                            
>         at org.jgroups.protocols.Discovery.down(Discovery.java:363)                                                                                                              
>         at org.jgroups.protocols.MERGE2.down(MERGE2.java:169)                                                                                                                    
>         at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:333)                                                                                                                  
>         at org.jgroups.protocols.FD.down(FD.java:327)                                                                                                                            
>         at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:72)                                                                                                     
>         at org.jgroups.protocols.BARRIER.down(BARRIER.java:96)                                                                                                                   
>         at org.jgroups.protocols.pbcast.NAKACK.retransmit(NAKACK.java:1530)                                                                                                      
>         at org.jgroups.protocols.pbcast.NAKACK.retransmit(NAKACK.java:1476)                                                                                                      
>         at org.jgroups.stack.Retransmitter$Task.run(Retransmitter.java:207)                                                                                                      
>         at org.jgroups.util.TimeScheduler$TaskWrapper.run(TimeScheduler.java:218)                                                                                                
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)                                                                                               
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)                                                                                                    
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)                                                                                                              
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)                                                  
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)                                                        
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)                                                                                   
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)                                                                                       
>         at java.lang.Thread.run(Thread.java:619)
>    Locked ownable synchronizers:                                                                                                                                                 
>         - <0x2b5522e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)                                                                                                  
>         - <0x2b5550f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)                                                                                                  
>         - <0x434273f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> 2) and there are many threads like this blocked "on out_stream_lock.lock()" call inside TP class:
> "Connection.Receiver [ip:port - ip:port],name,ip:port" prio=10 tid=0x020dc000 nid=0x1fab waiting on condition [0x1365f000..0x1365ffc0]
>    java.lang.Thread.State: WAITING (parking)                                                                                                                                     
>         at sun.misc.Unsafe.park(Native Method)                                                                                                                                   
>         - parking to wait for  <0x2b5550f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)                                                                             
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)                                                                                                     
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)                                                      
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)                                                              
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)                                                                   
>         at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)                                                                                     
>         at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)                                                                                                 
>         at org.jgroups.protocols.TP.send(TP.java:1082)                                                                                                                           
>         at org.jgroups.protocols.TP.down(TP.java:907)                                                                                                                            
>         at org.jgroups.protocols.Discovery.up(Discovery.java:274)                                                                                                                
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:995)                                                                                                                   
>         at org.jgroups.protocols.TP.access$100(TP.java:52)                                                                                                                       
>         at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1451)                                                                                                 
>         at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1427)                                                                                                             
>         at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:1738)                                                              
>         at org.jgroups.util.ShutdownRejectedExecutionHandler.rejectedExecution(ShutdownRejectedExecutionHandler.java:34)                                                         
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)                                                                                           
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)                                                                                          
>         at org.jgroups.protocols.TP.dispatchToThreadPool(TP.java:1061)                                                                                                           
>         at org.jgroups.protocols.TP.receive(TP.java:1038)                                                                                                                        
>         at org.jgroups.protocols.BasicTCP.receive(BasicTCP.java:180)                                                                                                             
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection$ConnectionPeerReceiver.run(TCPConnectionMap.java:553)                                                               
>         at java.lang.Thread.run(Thread.java:619)                                                                                                                                 
>                                                                                                                                                                                  
>    Locked ownable synchronizers:                                                                                                                                                 
>         - None
> (Since all threads from JGroups' ThreadPool are locked, you can see "rejectedExecution" in the last stack trace).
> Maybe we could implement some solution for TCP (blocking I/O)? Maybe it is possible to release the lock before socket write?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list