[jboss-jira] [JBoss JIRA] Commented: (JGRP-967) Deadlock in FD_SOCK

Tue Oct 6 17:37:49 EDT 2009

    [ https://jira.jboss.org/jira/browse/JGRP-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12488708#action_12488708 ] 

Richard Achmatowicz commented on JGRP-967:
------------------------------------------

Hi Vladimir

I did the following to check the deadlock using JCarder:
- download JGroups 2.4.5
- replace FD_SOCK.java with your new version
- add the following jvmarg to the runtests macro so that any test will have the JCarder javaagent excuted with it
<jvmarg value="-javaagent:/home/nrla/java/jcarder/jcarder.jar"/>
- run the org.jgroups.RpcDispatcherAnycastTest using the modified testsuite and one-test
./build.sh -Dtest=RpcDispatcherAnycastTest one-test

- saw the following output on the console:
     [echo] Using set: one.test
    [junit] Running org.jgroups.blocks.RpcDispatcherAnycastTest
    [junit] Starting JCarder (1.0.1/2) agent
    [junit] Opening for writing: /tmp/barney/JGroups-2.4.5/jcarder_events.db
    [junit] Opening for writing: /tmp/barney/JGroups-2.4.5/jcarder_contexts.db
    [junit] Not instrumenting standard library classes (AWT, Swing, etc.)
    [junit] JCarder agent initialized

    [junit] -------------------------------------------------------
    [junit] GMS: address is 192.168.0.100:60438
    [junit] -------------------------------------------------------

- processed the resulting jacarder output files using the jcarder jar file::
$ java -jar /home/nrla/java/jcarder/jcarder.jar 
Opening for reading: /tmp/barney/JGroups-2.4.5/jcarder_contexts.db
Opening for reading: /tmp/barney/JGroups-2.4.5/jcarder_events.db

Loaded from database files:
   Nodes: 1079
   Edges: 1884 (excluding 1402 duplicated)

Cycle analysis result: 
   Cycles:          3
   Edges in cycles: 5
   Nodes in cycles: 4
   Max cycle depth: 2
   Max graph depth: 5

Ignoring 1 almost identical cycle(s).
Writing Graphviz file: /tmp/barney/JGroups-2.4.5/jcarder_result_0.dot
- processed the graph files stored in dot to png graphics:
$ dot -T png -o result0.png jcarder_result_0.dot 
- printed the graphs

Unfortunately, I got exactly the same results as before (the graphs which are posted above). The test passes, but JCrader calculates that a possiobe deadlock canb occur, with a different interleaving of threads than the one actually taken.

> Deadlock in FD_SOCK 
> --------------------
>
>                 Key: JGRP-967
>                 URL: https://jira.jboss.org/jira/browse/JGRP-967
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.4.5
>         Environment: Deadlock detected on Fedora 8, JDK 1.4/JDK 1.5.
>            Reporter: Richard Achmatowicz
>            Assignee: Bela Ban
>            Priority: Minor
>             Fix For: 2.4.7
>
>         Attachments: example0.png, example1.png, FD_SOCK.java, threadDump.FD_SOCK.txt
>
>
> Due to a problem with IPv6 addresses and ServerSocket connections hanging, a deadlock was revealed in FD_SOCK. 
> The deadlock reveals itself, for example, in the test RpcDispatcherAnycastTest at the end of the test, when the channels are being torn down. What follows are my original emails to Bela:
> Richard: 
> I've tracked down the test case failure of RpcDispatcherAnycastTest under IPv6 to a problem with shutting down the protocol stack. The test executes fine, but in the teardown phase, when the test tries to close the three JChannels which have been set up, the first channel closes correctly, but the second channel hangs.
> JChannel tries to disconnect from the group before shutting down by:
> (i) sending a DISCONNECT event and waits for a DISCONNECT_OK event via a promise
> (ii) sending a STOP_QUEUING event and waita for a return from the call (i.e. has reached the bottom of the stack)
> It then calls ProtocolStack.stopStack() which sends a STOP event down the stack and waits for a STOP_OK event via a promise.
> The STOP event is not making its way correctly down the stack.
> Here is a trace with IPv4 (i've added in some tracing of the STOP event of my own):
>    [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
>    [junit] JChannel.disconnect(): got DISCONNECT_OK
>    [junit] JChannel.disconnect(): stopping queue
>    [junit] FD_SOCK.down called, event = STOP_QUEUEING
>    [junit] JChannel.disconnect(): stopped queue
>    [junit] JChannel.disconnect(): stopping stack
>    [junit] ProtocolStack: Sending STOP event
>    [junit] STATE_TRANSFER: STOP event received
>    [junit] FRAG2: STOP event received
>    [junit] FC: STOP event received
>    [junit] GMS: STOP event received
>    [junit] VIEW_SYNC: STOP event received
>    [junit] pbcast.STABLE: STOP event received
>    [junit] UNICAST: STOP event received
>    [junit] VERIFY_SUSPECT: STOP event received
>    [junit] FD: STOP event received
>    [junit] FD_SOCK.down called, event = STOP
>    [junit] FD_SOCK: STOP event received
>    [junit] MERGE2: STOP event received
>    [junit] PING: STOP event received
>    [junit] ProtocolStack: Received STOP event
>    [junit] JChannel.disconnect(): stopped stack
> Here is a bad trace with IPv6:
>    [junit] JChannel.disconnect(): waiting for DISCONNECT_OK
>    [junit] JChannel.disconnect(): got DISCONNECT_OK
>    [junit] JChannel.disconnect(): stopping queue
>    [junit] FD_SOCK.down called, event = STOP_QUEUEING
>    [junit] JChannel.disconnect(): stopped queue
>    [junit] JChannel.disconnect(): stopping stack
>    [junit] ProtocolStack: Sending STOP event
>    [junit] STATE_TRANSFER: STOP event received
>    [junit] FRAG2: STOP event received
>    [junit] FC: STOP event received
>    [junit] GMS: STOP event received
>    [junit] VIEW_SYNC: STOP event received
>    [junit] pbcast.STABLE: STOP event received
>    [junit] UNICAST: STOP event received
>    [junit] VERIFY_SUSPECT: STOP event received
>    [junit] FD: STOP event received
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
>    [junit] FD_SOCK.down called, event = MSG
> If I remove FD_SOCK from the stack, the tests pass. If I include it, this stuff happens.
> I also found that if I turn on the uphandler and downhandler threads in FD_SOCK, the problem disappears:
> ...
>    <MERGE2 max_interval="30000" down_thread="false" up_thread="false" min_interval="10000"/>
>    <FD_SOCK down_thread="true" up_thread="true"/>    <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
> ...
> Bela:
> Then it must be a locking issue, I'll take a look tomorrow. Or if you find the solution sooner, all the better !  :-)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira