]
Vladimir Blagojevic commented on JGRP-608:
------------------------------------------
Michael,
I was aware of the these problems but as we were in the process of simplifying flush
(JGRP-598) I never got around to verify how concurrent flush works with a revised FLUSH
version. Would you please verify FLUSH.java version 1.75 that I checked in this morning
and see if you still experience these problems.
Vladimir
FLUSH not safe in simultaneous flush situation
----------------------------------------------
Key: JGRP-608
URL:
http://jira.jboss.com/jira/browse/JGRP-608
Project: JGroups
Issue Type: Feature Request
Affects Versions: 2.6
Reporter: Michael Newcomb
Assigned To: Vladimir Blagojevic
Priority: Critical
Fix For: 2.6
I'm running into a few problems when multiple members request a FLUSH at the same
time.
I am still in the process of analyzing the situation, but here are a few problems:
private void handleStartFlush(Message msg, FlushHeader fh) {
byte oldPhase = flushPhase.transitionToFirstPhase();
if(oldPhase == FlushPhase.START_PHASE){
sendBlockUpToChannel();
onStartFlush(msg.getSrc(), fh);
}else if(oldPhase == FlushPhase.FIRST_PHASE){
Address flushRequester = msg.getSrc();
Address coordinator = null;
synchronized(sharedLock){
if(flushCoordinator != null)
coordinator = flushCoordinator;
else
coordinator = flushRequester;
}
After a successful transtion to the first phase, onStartFlush is called. Inside the
method, the flushCoordinator is set. However a simultaneous handleStartFlush from another
member would fail the transitionToFirstPhase() call, but it is possible that the
flushCoordinator might not be set yet and the second handleStartFlush would end up setting
the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a
FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase().
IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to
avoid the above situation, in fact, all of the protected parts of the onStartFlush() call
should be made inside the protected section of transitionToFirstPhase().
The above should take care of the
didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet
problem.
Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the
new flushRequestor is < the current flushCoordinator the new requestor will win the
flush fight. This is a great strategy because all the members will get the same result
here.
If A, B, C are members and A < B < C and they all simultaneously request a flush,
then the desired effect is for everyone to decide that A will win the flush fight.
So the following code handles that situation...
private void handleStartFlush(Message msg, FlushHeader fh) {
...
}else if(oldPhase == FlushPhase.FIRST_PHASE){
...
if(flushRequester.compareTo(coordinator) < 0){
rejectFlush(fh.viewID, coordinator);
...
synchronized(sharedLock){
flushCoordinator = flushRequester;
}
}
If the new flushRequetor is < the old flushCoordinator than reject/abort the old
flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which
sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to
transitionToFirstPhase() then the flush members would never be set properly, so that
situation needs to be resolved as well.
I think this might be the problem I'm seeing because I wait indefinitely for my flush
to succeed.
I will be testing a patch this afternoon and hope to report back.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: