[jboss-jira] [JBoss JIRA] Created: (JGRP-967) Deadlock in FD_SOCK

Tue Apr 28 15:51:46 EDT 2009

Deadlock in FD_SOCK 
--------------------

                 Key: JGRP-967
                 URL: https://jira.jboss.org/jira/browse/JGRP-967
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.4.5
         Environment: Deadlock detected on Fedora 8, JDK 1.4/JDK 1.5.
            Reporter: Richard Achmatowicz
            Assignee: Bela Ban
            Priority: Minor

Due to a problem with IPv6 addresses and ServerSocket connections hanging, a deadlock was revealed in FD_SOCK. 

The deadlock reveals itself, for example, in the test RpcDispatcherAnycastTest at the end of the test, when the channels are being torn down. What follows are my original emails to Bela:

Richard: 
I've tracked down the test case failure of RpcDispatcherAnycastTest under IPv6 to a problem with shutting down the protocol stack. The test executes fine, but in the teardown phase, when the test tries to close the three JChannels which have been set up, the first channel closes correctly, but the second channel hangs.

JChannel tries to disconnect from the group before shutting down by:
(i) sending a DISCONNECT event and waits for a DISCONNECT_OK event via a promise
(ii) sending a STOP_QUEUING event and waita for a return from the call (i.e. has reached the bottom of the stack)

It then calls ProtocolStack.stopStack() which sends a STOP event down the stack and waits for a STOP_OK event via a promise.
The STOP event is not making its way correctly down the stack.

Here is a trace with IPv4 (i've added in some tracing of the STOP event of my own):
   [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
   [junit] JChannel.disconnect(): got DISCONNECT_OK
   [junit] JChannel.disconnect(): stopping queue
   [junit] FD_SOCK.down called, event = STOP_QUEUEING
   [junit] JChannel.disconnect(): stopped queue
   [junit] JChannel.disconnect(): stopping stack
   [junit] ProtocolStack: Sending STOP event
   [junit] STATE_TRANSFER: STOP event received
   [junit] FRAG2: STOP event received
   [junit] FC: STOP event received
   [junit] GMS: STOP event received
   [junit] VIEW_SYNC: STOP event received
   [junit] pbcast.STABLE: STOP event received
   [junit] UNICAST: STOP event received
   [junit] VERIFY_SUSPECT: STOP event received
   [junit] FD: STOP event received
   [junit] FD_SOCK.down called, event = STOP
   [junit] FD_SOCK: STOP event received
   [junit] MERGE2: STOP event received
   [junit] PING: STOP event received
   [junit] ProtocolStack: Received STOP event
   [junit] JChannel.disconnect(): stopped stack

Here is a bad trace with IPv6:
   [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
   [junit] JChannel.disconnect(): got DISCONNECT_OK
   [junit] JChannel.disconnect(): stopping queue
   [junit] FD_SOCK.down called, event = STOP_QUEUEING
   [junit] JChannel.disconnect(): stopped queue
   [junit] JChannel.disconnect(): stopping stack
   [junit] ProtocolStack: Sending STOP event
   [junit] STATE_TRANSFER: STOP event received
   [junit] FRAG2: STOP event received
   [junit] FC: STOP event received
   [junit] GMS: STOP event received
   [junit] VIEW_SYNC: STOP event received
   [junit] pbcast.STABLE: STOP event received
   [junit] UNICAST: STOP event received
   [junit] VERIFY_SUSPECT: STOP event received
   [junit] FD: STOP event received
   [junit] FD_SOCK.down called, event = MSG
   [junit] FD_SOCK.down called, event = MSG
   [junit] FD_SOCK.down called, event = MSG
   [junit] FD_SOCK.down called, event = MSG

If I remove FD_SOCK from the stack, the tests pass. If I include it, this stuff happens.

I also found that if I turn on the uphandler and downhandler threads in FD_SOCK, the problem disappears:
...
   <MERGE2 max_interval="30000" down_thread="false" up_thread="false" min_interval="10000"/>
   <FD_SOCK down_thread="true" up_thread="true"/>    <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
...

Bela:
Then it must be a locking issue, I'll take a look tomorrow. Or if you find the solution sooner, all the better !  :-)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira