[jboss-jira] [JBoss JIRA] Commented: (JGRP-750) Deadlock between GroupRequest and FLUSH during concurrent startup.

Fri May 2 09:15:24 EDT 2008

    [ http://jira.jboss.com/jira/browse/JGRP-750?page=comments#action_12411583 ] 

Vladimir Blagojevic commented on JGRP-750:
------------------------------------------

I found a cause for this problem. When application thread calls castMessage underlying plumbing in MessageDispatcher sends a GroupRequest which in turns obtains a lock held until response arrives. In MuxChannel#connect we send a sync call across the cluster to update a service view. When this request comes to a node that has sent GroupRequest  it needs to obtain the same lock to update a view in GroupRequest#viewChange. However, it cannot obtain a lock since it is held by an application thread which has been in turned blocked in FLUSH.down. So request times out.

So you ask how come this happens in MuxChannel and not in JChannel. Well, in MuxChannel#connect we do two flushes in case it is a first service connecting on top of a unconnected "real" channel. The infringing call castMessage squeezes itself in between these two flushes.

> Deadlock between GroupRequest and FLUSH during concurrent startup.
> ------------------------------------------------------------------
>
>                 Key: JGRP-750
>                 URL: http://jira.jboss.com/jira/browse/JGRP-750
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.3
>         Environment: Debian etch (i386), Java HotSpot(TM) Server VM (build 1.6.0-b105, mixed mode)
>            Reporter: Robert Newson
>         Assigned To: Vladimir Blagojevic
>         Attachments: ConcurrentStartupWithGroupRequestTest.java, ConcurrentStartupWithGroupRequestTest.java, stacktrace.txt
>
>
> We've been having more trouble with concurrent start up and now think
> we've isolated a deadlock between FLUSH and GroupRequest during
> concurrent startup.
> We have four boxes that join a channel and use MessageDispatcher
> immediately after connecting. This frequently blocks indefinitely.
> GroupRequest.execute() obtains a lock, then a subsequent view change
> comes in which does likewise. The upshot is that we can see all
> Incoming threads are blocked for the lock and the only way it can be
> released is for a stop_flush message to occur. With all incoming
> threads blocked, that never happens.
> In the attached unit test if you add this after the call to connect("A"), it passes, implying a deadlock;
> if (j ==0) {
>   Thread.sleep(500);
> }
> Additionally, and this is more speculative, it seems the wait/notify code in pbcast does not account for the spurious wakeup case. I don't know under what circumstances they happen, and I don't believe we're seeing spurious wakes at this time, but it should be fixed at some stage.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira