[jboss-jira] [JBoss JIRA] Reopened: (JGRP-1003) deadlock in TP.send - socket write hangs inside a Lock, preventing other threads to do TP.send()

Thu Nov 26 06:06:31 EST 2009

     [ https://jira.jboss.org/jira/browse/JGRP-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Victor N reopened JGRP-1003:
----------------------------

Bela, the problem is reproduced in 2.8.0 CR6.

A similar environment as it was before: 8 nodes (named node5 ... node 12).
After some time (~2 days) something went wrong on node5 and I have seen that all my threads were blocked in JGroups code and can not send my messages.

My config is based on tcp.xml:

-----------------------------------------------------------------
<config xmlns="urn:org:jgroups"                                                                                                                                                  
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"                                                                                                                    
        xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">                                                                                                        
    <TCP bind_addr="${bindAddress}"                                                                                                                                 
                 bind_port="40002"                                                                                                                            
         loopback="true"                                                                                                                                                         
         recv_buf_size="20000000"                                                                                                                                                
         send_buf_size="640000"                                                                                                                                                  
         discard_incompatible_packets="true"                                                                                                                                     
         max_bundle_size="64000"                                                                                                                                                 
         max_bundle_timeout="30"                                                                                                                                                 
         enable_bundling="true"                                                                                                                                                  
         use_send_queues="true"                                                                                                                                                  
         sock_conn_timeout="300"                                                                                                                                                 
         skip_suspected_members="true"                                                                                                                                           

         thread_pool.enabled="true"                                                                                                                                              
         thread_pool.min_threads="1"                                                                                                                                             
         thread_pool.max_threads="25"                                                                                                                                            
         thread_pool.keep_alive_time="5000"                                                                                                                                      
         thread_pool.queue_enabled="false"                                                                                                                                       
         thread_pool.queue_max_size="10000"                                                                                                                                      
         thread_pool.rejection_policy="run"                                                                                                                                      

         oob_thread_pool.enabled="true"                                                                                                                                          
         oob_thread_pool.min_threads="1"                                                                                                                                         
         oob_thread_pool.max_threads="8"                                                                                                                                         
         oob_thread_pool.keep_alive_time="5000"                                                                                                                                  
         oob_thread_pool.queue_enabled="false"                                                                                                                                   
         oob_thread_pool.queue_max_size="1000"                                                                                                                                   
         oob_thread_pool.rejection_policy="run"/>                                                                                                                                

    <TCPPING timeout="3000"                                                                                                                                                      
             initial_hosts="${nodes}"                                                                                                                               
             port_range="1"                                                                                                                                                      
             num_initial_members="3"/>                                                                                                                                           
    <MERGE2  min_interval="10000"                                                                                                                                                
             max_interval="30000"/>                                                                                                                                              
    <FD_SOCK/>                                                                                                                                                                   
    <FD timeout="10000" max_tries="5" />                                                                                                                                         
    <VERIFY_SUSPECT timeout="1500"  />                                                                                                                                           
    <BARRIER />
    <pbcast.NAKACK                                                                                                                                                               
                   use_mcast_xmit="false" gc_lag="0"                                                                                                                             
                   retransmit_timeout="300,600,1200,2400,4800"                                                                                                                   
                   discard_delivered_msgs="true"/>                                                                                                                               
    <UNICAST timeout="300,600,1200" />                                                                                                                                           
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"                                                                                                             
                   max_bytes="400000"/>                                                                                                                                          
    <pbcast.GMS print_local_addr="true" join_timeout="3000"                                                                                                                      

                view_bundling="true"/>                                                                                                                                           
    <FC max_credits="2000000"                                                                                                                                                    
        min_threshold="0.10"/>                                                                                                                                                   
    <FRAG2 frag_size="60000"  />                                                                                                                                                 
    <!-- <pbcast.STREAMING_STATE_TRANSFER/> -->                                                                                                                                  
    <!-- <pbcast.STATE_TRANSFER/> -->                                                                                                                                            
</config>
-----------------------------------------------------------------

Stack-traces from node5 are attached in j1__.txt file ; I changed IP addresses and names a little bit (node5 - 192.168.1.5, node6 - 192.168.1.6 ...).
As you can see there are 7 Connection.Sender and 7 Connection.Receiver threads, this is ok for 8 nodes.

So, about my locked threads... I have 10 threads named "pool-1-thread-XXX", all of them are locked trying to acquire lock in Bundler:

at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
at org.jgroups.protocols.TP$Bundler.send(TP.java:1538)

The only thread which took this lock is "Timer-2,TestCluster,node5" locked here:

at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
at org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(TCPConnectionMap.java:420)

This thread is locked here because Sender's queue is full.
Since this is the only place where the lock is owned, seems that all my threads are waiting for the SAME lock instance, and this means they are working on the same Sender.

If you look at all Connection.Sender's, they are all trying to take from queue:

at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
at org.jgroups.blocks.TCPConnectionMap$TCPConnection$Sender.run(TCPConnectionMap.java:603)

And this stack-trace says that all Senders have empty queues. It is strange: one thread is waiting until the queue is not full but I do not see any full queue!

The only idea I have is that there are different lock instances, and so, different Sender instances - maybe Sender thread terminated somehow and was restarted, but all my threads are working on the old Sender instance. But In do not see any warning about this in my logs at INFO/WARN/ERROR levels.

Bela, do you have any ideas?

> deadlock in TP.send - socket write hangs inside a Lock, preventing other threads to do TP.send()
> ------------------------------------------------------------------------------------------------
>
>                 Key: JGRP-1003
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1003
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.7
>         Environment: Linux (Debian), kernel 2.6.x
>            Reporter: Victor N
>            Assignee: Bela Ban
>             Fix For: 2.8
>
>         Attachments: jgroups_s2.txt, stacktrace.txt
>
>
> I am using JGroups 2.7.0 GA with the typical protocols stack based on TCP (taken from tcp.xml).
> Sometimes it occurs that socket write operation hangs inside TP.send() -- this is not so untypical for blocking I/O approach! -- after that JGroups is not working at all because most of threads are blocked inside TP.send() waiting for that first thread releasing "out_stream_lock". Below is a fragment from my stack trace.
> 1) the thread that hangs in socket write until forever (maybe due to a broken socket or something else); this thread owns  "out_stream_lock in TP class":
> "Timer-1,name,IP:port" daemon prio=10 tid=0x084e5c00 nid=0x15f0 runnable [0x211c1000..0x211c2040]                                                             
>    java.lang.Thread.State: RUNNABLE                                                                                                                                              
>         at java.net.SocketOutputStream.socketWrite0(Native Method)                                                                                                               
>         at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)                                                                                                   
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:136)                                                                                                        
>         at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)                                                                                                
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)                                                                                                     
>         - locked <0x43427300> (a java.io.BufferedOutputStream)                                                                                                                   
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)                                                                                                             
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection.doSend(TCPConnectionMap.java:472)                                                                                   
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection._send(TCPConnectionMap.java:453)                                                                                    
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(TCPConnectionMap.java:436)                                                                                     
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection.access$100(TCPConnectionMap.java:341)                                                                               
>         at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:137)                                                                                                   
>         at org.jgroups.protocols.TCP.send(TCP.java:53)                                                                                                                           
>         at org.jgroups.protocols.BasicTCP.sendToSingleMember(BasicTCP.java:141)                                                                                                  
>         at org.jgroups.protocols.TP.doSend(TP.java:1105)                                                                                                                         
>         at org.jgroups.protocols.TP.send(TP.java:1088)                                                                                                                           
>         at org.jgroups.protocols.TP.down(TP.java:907)                                                                                                                            
>         at org.jgroups.protocols.Discovery.down(Discovery.java:363)                                                                                                              
>         at org.jgroups.protocols.MERGE2.down(MERGE2.java:169)                                                                                                                    
>         at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:333)                                                                                                                  
>         at org.jgroups.protocols.FD.down(FD.java:327)                                                                                                                            
>         at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:72)                                                                                                     
>         at org.jgroups.protocols.BARRIER.down(BARRIER.java:96)                                                                                                                   
>         at org.jgroups.protocols.pbcast.NAKACK.retransmit(NAKACK.java:1530)                                                                                                      
>         at org.jgroups.protocols.pbcast.NAKACK.retransmit(NAKACK.java:1476)                                                                                                      
>         at org.jgroups.stack.Retransmitter$Task.run(Retransmitter.java:207)                                                                                                      
>         at org.jgroups.util.TimeScheduler$TaskWrapper.run(TimeScheduler.java:218)                                                                                                
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)                                                                                               
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)                                                                                                    
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)                                                                                                              
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)                                                  
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)                                                        
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)                                                                                   
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)                                                                                       
>         at java.lang.Thread.run(Thread.java:619)
>    Locked ownable synchronizers:                                                                                                                                                 
>         - <0x2b5522e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)                                                                                                  
>         - <0x2b5550f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)                                                                                                  
>         - <0x434273f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
> 2) and there are many threads like this blocked "on out_stream_lock.lock()" call inside TP class:
> "Connection.Receiver [ip:port - ip:port],name,ip:port" prio=10 tid=0x020dc000 nid=0x1fab waiting on condition [0x1365f000..0x1365ffc0]
>    java.lang.Thread.State: WAITING (parking)                                                                                                                                     
>         at sun.misc.Unsafe.park(Native Method)                                                                                                                                   
>         - parking to wait for  <0x2b5550f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)                                                                             
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)                                                                                                     
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)                                                      
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)                                                              
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)                                                                   
>         at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)                                                                                     
>         at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)                                                                                                 
>         at org.jgroups.protocols.TP.send(TP.java:1082)                                                                                                                           
>         at org.jgroups.protocols.TP.down(TP.java:907)                                                                                                                            
>         at org.jgroups.protocols.Discovery.up(Discovery.java:274)                                                                                                                
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:995)                                                                                                                   
>         at org.jgroups.protocols.TP.access$100(TP.java:52)                                                                                                                       
>         at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1451)                                                                                                 
>         at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1427)                                                                                                             
>         at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:1738)                                                              
>         at org.jgroups.util.ShutdownRejectedExecutionHandler.rejectedExecution(ShutdownRejectedExecutionHandler.java:34)                                                         
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)                                                                                           
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)                                                                                          
>         at org.jgroups.protocols.TP.dispatchToThreadPool(TP.java:1061)                                                                                                           
>         at org.jgroups.protocols.TP.receive(TP.java:1038)                                                                                                                        
>         at org.jgroups.protocols.BasicTCP.receive(BasicTCP.java:180)                                                                                                             
>         at org.jgroups.blocks.TCPConnectionMap$TCPConnection$ConnectionPeerReceiver.run(TCPConnectionMap.java:553)                                                               
>         at java.lang.Thread.run(Thread.java:619)                                                                                                                                 
>                                                                                                                                                                                  
>    Locked ownable synchronizers:                                                                                                                                                 
>         - None
> (Since all threads from JGroups' ThreadPool are locked, you can see "rejectedExecution" in the last stack trace).
> Maybe we could implement some solution for TCP (blocking I/O)? Maybe it is possible to release the lock before socket write?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira