[jboss-jira] [JBoss JIRA] Commented: (JGRP-967) Deadlock in FD_SOCK

Tue Apr 28 15:55:46 EDT 2009

    [ https://jira.jboss.org/jira/browse/JGRP-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12464856#action_12464856 ] 

Richard Achmatowicz commented on JGRP-967:
------------------------------------------

Just to clarify what is being described with JCarder (my understanding anyway):

the potential deadlock for the two thread case (nodes in red) is between:
(i) IncomingPackethandler thread (in TP) calling FD_SOCK.down() and processing the VIEW_CHANGE event - in that event, a synchronized block is enetered and then a lock on the pinger_mutex is taken
(ii) main thread (I guess the test thread coming down from JChannel) calling FD_SOCK.stop() which calls stopPingerThread() which synchronizes on pinger_mutex and then calls the synchronized method sendPingTermination

The other case in example1 is a single thread cycle and won't result in a deadlock as locks are reentrant in Java - it's listed more as a potential design problem.
So maybe these are red herrings - I removed the synchronization elements and ran the test again and the problem still occurred - but it may be occurring now for a different reason.
As far as I can tell, FD calls FD_SOCK with the STOP event, but it just 'disappaers' - maybe I need to do a thread dump. 

> Deadlock in FD_SOCK 
> --------------------
>
>                 Key: JGRP-967
>                 URL: https://jira.jboss.org/jira/browse/JGRP-967
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.4.5
>         Environment: Deadlock detected on Fedora 8, JDK 1.4/JDK 1.5.
>            Reporter: Richard Achmatowicz
>            Assignee: Bela Ban
>            Priority: Minor
>         Attachments: example0.png, example1.png
>
>
> Due to a problem with IPv6 addresses and ServerSocket connections hanging, a deadlock was revealed in FD_SOCK. 
> The deadlock reveals itself, for example, in the test RpcDispatcherAnycastTest at the end of the test, when the channels are being torn down. What follows are my original emails to Bela:
> Richard: 
> I've tracked down the test case failure of RpcDispatcherAnycastTest under IPv6 to a problem with shutting down the protocol stack. The test executes fine, but in the teardown phase, when the test tries to close the three JChannels which have been set up, the first channel closes correctly, but the second channel hangs.
> JChannel tries to disconnect from the group before shutting down by:
> (i) sending a DISCONNECT event and waits for a DISCONNECT_OK event via a promise
> (ii) sending a STOP_QUEUING event and waita for a return from the call (i.e. has reached the bottom of the stack)
> It then calls ProtocolStack.stopStack() which sends a STOP event down the stack and waits for a STOP_OK event via a promise.
> The STOP event is not making its way correctly down the stack.
> Here is a trace with IPv4 (i've added in some tracing of the STOP event of my own):
>    [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
>    [junit] JChannel.disconnect(): got DISCONNECT_OK
>    [junit] JChannel.disconnect(): stopping queue
>    [junit] FD_SOCK.down called, event = STOP_QUEUEING
>    [junit] JChannel.disconnect(): stopped queue
>    [junit] JChannel.disconnect(): stopping stack
>    [junit] ProtocolStack: Sending STOP event
>    [junit] STATE_TRANSFER: STOP event received
>    [junit] FRAG2: STOP event received
>    [junit] FC: STOP event received
>    [junit] GMS: STOP event received
>    [junit] VIEW_SYNC: STOP event received
>    [junit] pbcast.STABLE: STOP event received
>    [junit] UNICAST: STOP event received
>    [junit] VERIFY_SUSPECT: STOP event received
>    [junit] FD: STOP event received
>    [junit] FD_SOCK.down called, event = STOP
>    [junit] FD_SOCK: STOP event received
>    [junit] MERGE2: STOP event received
>    [junit] PING: STOP event received
>    [junit] ProtocolStack: Received STOP event
>    [junit] JChannel.disconnect(): stopped stack
> Here is a bad trace with IPv6:
>    [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
>    [junit] JChannel.disconnect(): got DISCONNECT_OK
>    [junit] JChannel.disconnect(): stopping queue
>    [junit] FD_SOCK.down called, event = STOP_QUEUEING
>    [junit] JChannel.disconnect(): stopped queue
>    [junit] JChannel.disconnect(): stopping stack
>    [junit] ProtocolStack: Sending STOP event
>    [junit] STATE_TRANSFER: STOP event received
>    [junit] FRAG2: STOP event received
>    [junit] FC: STOP event received
>    [junit] GMS: STOP event received
>    [junit] VIEW_SYNC: STOP event received
>    [junit] pbcast.STABLE: STOP event received
>    [junit] UNICAST: STOP event received
>    [junit] VERIFY_SUSPECT: STOP event received
>    [junit] FD: STOP event received
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
> If I remove FD_SOCK from the stack, the tests pass. If I include it, this stuff happens.
> I also found that if I turn on the uphandler and downhandler threads in FD_SOCK, the problem disappears:
> ...
>    <MERGE2 max_interval="30000" down_thread="false" up_thread="false" min_interval="10000"/>
>    <FD_SOCK down_thread="true" up_thread="true"/>    <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
> ...
> Bela:
> Then it must be a locking issue, I'll take a look tomorrow. Or if you find the solution sooner, all the better !  :-)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira