[jboss-jira] [JBoss JIRA] Updated: (JGRP-668) Deadlock condition in BARRIER

Mon Jan 21 11:39:22 EST 2008

     [ http://jira.jboss.com/jira/browse/JGRP-668?page=all ]

Gray Watson updated JGRP-668:
-----------------------------

    Attachment: barrier_deadlock.txt

Stack trace showing the 3 classes of threads that are deadlocked.

> Deadlock condition in BARRIER
> -----------------------------
>
>                 Key: JGRP-668
>                 URL: http://jira.jboss.com/jira/browse/JGRP-668
>             Project: JGroups
>          Issue Type: Bug
>            Reporter: Bela Ban
>         Assigned To: Bela Ban
>             Fix For: 2.7
>
>         Attachments: barrier_deadlock.txt
>
>
> Hey Bela et al:
> We've been fighting a lossy network (UDP receive errors) on a cluster of 50
> machines and managed to produce 2 coordinators who refused to MERGE3.  A
> closer examination reviewed 
> http://www.nabble.com/file/p14991972/barrier_deadlock.txt this stack trace 
> which showed that there was one thread trying to satisfy a STATE_REQ msg
> blocked down in BARRIER.closeBarrier() waiting for the in_flight_threads to
> empty, another thread was trying to service another STATE_REQ and was
> blocked trying to lock the state_requesters table up in
> STATE_TRANSFER.handleStateReq(), and another 12 threads blocked waiting for
> the barrier to open in BARRIER.up().
> We quickly found that we had a deadlock condition in BARRIER that was
> problematic --  http://www.nabble.com/file/p14991972/BARRIER.java.patch
> here's the patch to fix this .  However, we cannot see an easy way to fix 2
> STATE_REQ messages coming right after the other.  They will both enter the
> in_flight_threads set and only one will come back down to lock the barrier
> and will wait forever for the other one to leave in_flight_threads.  If we
> let the 2nd come down too, it may come back up before the in_flight_threads
> is clear since all it does is see that the barrer is closed and returns.
> Although we may have fixed part of the deadlock we saw, we are looking into
> switching to the FLUSH protocol instead because of the 2 STATE_REQ issue. 
> Just curious as to other's feedback about this issue and whether more folks
> are using FLUSH or BARRIER?
> Thanks much for an [otherwise] great code stack.  We are excited to be using
> it in our distributed database system project.
> gray

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira