[jboss-jira] [JBoss JIRA] Commented: (JGRP-967) Deadlock in FD_SOCK

Tue Apr 28 15:55:46 EDT 2009

    [ https://jira.jboss.org/jira/browse/JGRP-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12464857#action_12464857 ] 

Richard Achmatowicz commented on JGRP-967:
------------------------------------------

The thread dump from my machine is attached.

It looks as though:
(i) FD_SOCK Ping Thread  calls  setupPingSocket()  and synchronizes on sock_mutex.  It then seems to block, waiting for connections, and does not let go of sock_mutex.
(ii) main thread tries to call FD_SOCK.stop() then FD_SOCK.stopPingerThread() at which time it synchronizes on pinger_mutex, and then wants to call tearDownPingSocket where it has to synchronize  on sock_mutex. At this point, it starts waiting.

setupPingSocket() should return. It does most of the time but not in all cases.
I've noticed that setupPingSocket() is getting called before the tests run - it is also being called when the tests are shutting down, and this is where the problrm is occurring.

   [junit] tearing down - 2
   [junit] disconnecting
   [junit] JChannel.disconnect(): creating DISCONNECT event
   [junit] GMS:Received DISCONNECT event
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = STOP_QUEUEING
   [junit] FD_SOCK.down called: event = VIEW_CHANGE
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = STOP_QUEUEING
   [junit] FD_SOCK.down called: event = VIEW_CHANGE
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK: ping_addr = fe80:0:0:0:215:58ff:fec8:81a8:55961
   [junit] FD_SOCK: ping_addr IP addr = /fe80:0:0:0:215:58ff:fec8:81a8
   [junit] FD_SOCK: ping_addr port = 55961
   [junit] FD_SOCK: calling setupPingSocket()       <------------------------------------------------------------------ How come?
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK: dest = fe80:0:0:0:215:58ff:fec8:81a8:55961
   [junit] FD_SOCK: destIP addr = /fe80:0:0:0:215:58ff:fec8:81a8
   [junit] FD_SOCK: dest port = 55961
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = TMP_VIEW
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = MSG
   [junit] FD_SOCK.down called: event = MSG
   [junit] GMS:Sending DISCONNECT_OK event
   [junit] JChannel: Received DISCONNECT_OK - setting promise
   [junit] JChannel: promise set

Hope this gives more clues as to what is going on. Maybe the pinger_thread isn't being terminated early enough? 

> Deadlock in FD_SOCK 
> --------------------
>
>                 Key: JGRP-967
>                 URL: https://jira.jboss.org/jira/browse/JGRP-967
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.4.5
>         Environment: Deadlock detected on Fedora 8, JDK 1.4/JDK 1.5.
>            Reporter: Richard Achmatowicz
>            Assignee: Bela Ban
>            Priority: Minor
>         Attachments: example0.png, example1.png
>
>
> Due to a problem with IPv6 addresses and ServerSocket connections hanging, a deadlock was revealed in FD_SOCK. 
> The deadlock reveals itself, for example, in the test RpcDispatcherAnycastTest at the end of the test, when the channels are being torn down. What follows are my original emails to Bela:
> Richard: 
> I've tracked down the test case failure of RpcDispatcherAnycastTest under IPv6 to a problem with shutting down the protocol stack. The test executes fine, but in the teardown phase, when the test tries to close the three JChannels which have been set up, the first channel closes correctly, but the second channel hangs.
> JChannel tries to disconnect from the group before shutting down by:
> (i) sending a DISCONNECT event and waits for a DISCONNECT_OK event via a promise
> (ii) sending a STOP_QUEUING event and waita for a return from the call (i.e. has reached the bottom of the stack)
> It then calls ProtocolStack.stopStack() which sends a STOP event down the stack and waits for a STOP_OK event via a promise.
> The STOP event is not making its way correctly down the stack.
> Here is a trace with IPv4 (i've added in some tracing of the STOP event of my own):
>    [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
>    [junit] JChannel.disconnect(): got DISCONNECT_OK
>    [junit] JChannel.disconnect(): stopping queue
>    [junit] FD_SOCK.down called, event = STOP_QUEUEING
>    [junit] JChannel.disconnect(): stopped queue
>    [junit] JChannel.disconnect(): stopping stack
>    [junit] ProtocolStack: Sending STOP event
>    [junit] STATE_TRANSFER: STOP event received
>    [junit] FRAG2: STOP event received
>    [junit] FC: STOP event received
>    [junit] GMS: STOP event received
>    [junit] VIEW_SYNC: STOP event received
>    [junit] pbcast.STABLE: STOP event received
>    [junit] UNICAST: STOP event received
>    [junit] VERIFY_SUSPECT: STOP event received
>    [junit] FD: STOP event received
>    [junit] FD_SOCK.down called, event = STOP
>    [junit] FD_SOCK: STOP event received
>    [junit] MERGE2: STOP event received
>    [junit] PING: STOP event received
>    [junit] ProtocolStack: Received STOP event
>    [junit] JChannel.disconnect(): stopped stack
> Here is a bad trace with IPv6:
>    [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
>    [junit] JChannel.disconnect(): got DISCONNECT_OK
>    [junit] JChannel.disconnect(): stopping queue
>    [junit] FD_SOCK.down called, event = STOP_QUEUEING
>    [junit] JChannel.disconnect(): stopped queue
>    [junit] JChannel.disconnect(): stopping stack
>    [junit] ProtocolStack: Sending STOP event
>    [junit] STATE_TRANSFER: STOP event received
>    [junit] FRAG2: STOP event received
>    [junit] FC: STOP event received
>    [junit] GMS: STOP event received
>    [junit] VIEW_SYNC: STOP event received
>    [junit] pbcast.STABLE: STOP event received
>    [junit] UNICAST: STOP event received
>    [junit] VERIFY_SUSPECT: STOP event received
>    [junit] FD: STOP event received
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
> If I remove FD_SOCK from the stack, the tests pass. If I include it, this stuff happens.
> I also found that if I turn on the uphandler and downhandler threads in FD_SOCK, the problem disappears:
> ...
>    <MERGE2 max_interval="30000" down_thread="false" up_thread="false" min_interval="10000"/>
>    <FD_SOCK down_thread="true" up_thread="true"/>    <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
> ...
> Bela:
> Then it must be a locking issue, I'll take a look tomorrow. Or if you find the solution sooner, all the better !  :-)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira