[jboss-jira] [JBoss JIRA] Commented: (JGRP-668) Deadlock condition in BARRIER

Thursday, 28 February 2008

    [ http://jira.jboss.com/jira/browse/JGRP-668?page=comments#action_12400826 ] 

Bela Ban commented on JGRP-668:
-------------------------------

Applied the patch, thanks.

I also fixed (or so I think) the second issue. The solution was to remove blocked or
waiting threads in closeBarrier() (and re-insert them when closeBarrier() returns).
BARRIER was written to block messages from modifying the state and digest and blocked or
(to a lesser degree) waiting messages don't proceed, therefore don't modify the
state.

...
 Deadlock condition in BARRIER
 -----------------------------

                 Key: JGRP-668
                 URL: http://jira.jboss.com/jira/browse/JGRP-668
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6.1
            Reporter: Bela Ban
         Assigned To: Bela Ban
             Fix For: 2.7, 2.6.3

         Attachments: BARRIER.java.patch, barrier_deadlock.txt, jgroups_protocol.xml

 Hey Bela et al:
 We've been fighting a lossy network (UDP receive errors) on a cluster of 50
 machines and managed to produce 2 coordinators who refused to MERGE3.  A
 closer examination reviewed 
 http://www.nabble.com/file/p14991972/barrier_deadlock.txt this stack trace 
 which showed that there was one thread trying to satisfy a STATE_REQ msg
 blocked down in BARRIER.closeBarrier() waiting for the in_flight_threads to
 empty, another thread was trying to service another STATE_REQ and was
 blocked trying to lock the state_requesters table up in
 STATE_TRANSFER.handleStateReq(), and another 12 threads blocked waiting for
 the barrier to open in BARRIER.up().
 We quickly found that we had a deadlock condition in BARRIER that was
 problematic --  http://www.nabble.com/file/p14991972/BARRIER.java.patch
 here's the patch to fix this .  However, we cannot see an easy way to fix 2
 STATE_REQ messages coming right after the other.  They will both enter the
 in_flight_threads set and only one will come back down to lock the barrier
 and will wait forever for the other one to leave in_flight_threads.  If we
 let the 2nd come down too, it may come back up before the in_flight_threads
 is clear since all it does is see that the barrer is closed and returns.
 Although we may have fixed part of the deadlock we saw, we are looking into
 switching to the FLUSH protocol instead because of the 2 STATE_REQ issue. 
 Just curious as to other's feedback about this issue and whether more folks
 are using FLUSH or BARRIER?
 Thanks much for an [otherwise] great code stack.  We are excited to be using
 it in our distributed database system project.
 gray 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Commented: (JGRP-668) Deadlock condition in BARRIER