]
Bela Ban commented on JGRP-668:
-------------------------------
Applied the patch, thanks.
I also fixed (or so I think) the second issue. The solution was to remove blocked or
waiting threads in closeBarrier() (and re-insert them when closeBarrier() returns).
BARRIER was written to block messages from modifying the state and digest and blocked or
(to a lesser degree) waiting messages don't proceed, therefore don't modify the
state.
Deadlock condition in BARRIER
-----------------------------
Key: JGRP-668
URL:
http://jira.jboss.com/jira/browse/JGRP-668
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.1
Reporter: Bela Ban
Assigned To: Bela Ban
Fix For: 2.7, 2.6.3
Attachments: BARRIER.java.patch, barrier_deadlock.txt, jgroups_protocol.xml
Hey Bela et al:
We've been fighting a lossy network (UDP receive errors) on a cluster of 50
machines and managed to produce 2 coordinators who refused to MERGE3. A
closer examination reviewed
http://www.nabble.com/file/p14991972/barrier_deadlock.txt this stack trace
which showed that there was one thread trying to satisfy a STATE_REQ msg
blocked down in BARRIER.closeBarrier() waiting for the in_flight_threads to
empty, another thread was trying to service another STATE_REQ and was
blocked trying to lock the state_requesters table up in
STATE_TRANSFER.handleStateReq(), and another 12 threads blocked waiting for
the barrier to open in BARRIER.up().
We quickly found that we had a deadlock condition in BARRIER that was
problematic --
http://www.nabble.com/file/p14991972/BARRIER.java.patch
here's the patch to fix this . However, we cannot see an easy way to fix 2
STATE_REQ messages coming right after the other. They will both enter the
in_flight_threads set and only one will come back down to lock the barrier
and will wait forever for the other one to leave in_flight_threads. If we
let the 2nd come down too, it may come back up before the in_flight_threads
is clear since all it does is see that the barrer is closed and returns.
Although we may have fixed part of the deadlock we saw, we are looking into
switching to the FLUSH protocol instead because of the 2 STATE_REQ issue.
Just curious as to other's feedback about this issue and whether more folks
are using FLUSH or BARRIER?
Thanks much for an [otherwise] great code stack. We are excited to be using
it in our distributed database system project.
gray
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: