[
http://jira.jboss.com/jira/browse/JGRP-608?page=comments#action_12386560 ]
Michael Newcomb commented on JGRP-608:
--------------------------------------
197 private boolean startFlush(Event evt, int numberOfAttempts, boolean isRetry) {
198 boolean successfulFlush = false;
199 if(!flushInProgress.get() || isRetry){
200 flush_promise.reset();
While reviewing the new code, I came across the above in FLUSH.java.
Does isRetry really outweigh the fact that a flush is in progress? If a flush is in
progress, is there a need to retry starting a flush?
Also, further down, it is possible for a flush to complete sometime after the initial
timeout waiting for the flush to complete and the time the flush is attempted again. In
fact, a flush attempt that timed out, could even complete a flush has just been reset
(line 200).
I think that flush messages need to be tracked by member/viewid/flush# or something. So,
each flush message can be checked against a flush 'id'. Also, I think that having
a flush 'object' tracking all the participants, acknowledged members, and even the
rejected members would also aid in the debugging process.
Anyway, I'm taking the latest code into our lab and will report back soon.
FLUSH not safe in simultaneous flush situation
----------------------------------------------
Key: JGRP-608
URL:
http://jira.jboss.com/jira/browse/JGRP-608
Project: JGroups
Issue Type: Feature Request
Affects Versions: 2.6
Reporter: Michael Newcomb
Assigned To: Vladimir Blagojevic
Priority: Critical
Fix For: 2.6
I'm running into a few problems when multiple members request a FLUSH at the same
time.
I am still in the process of analyzing the situation, but here are a few problems:
private void handleStartFlush(Message msg, FlushHeader fh) {
byte oldPhase = flushPhase.transitionToFirstPhase();
if(oldPhase == FlushPhase.START_PHASE){
sendBlockUpToChannel();
onStartFlush(msg.getSrc(), fh);
}else if(oldPhase == FlushPhase.FIRST_PHASE){
Address flushRequester = msg.getSrc();
Address coordinator = null;
synchronized(sharedLock){
if(flushCoordinator != null)
coordinator = flushCoordinator;
else
coordinator = flushRequester;
}
After a successful transtion to the first phase, onStartFlush is called. Inside the
method, the flushCoordinator is set. However a simultaneous handleStartFlush from another
member would fail the transitionToFirstPhase() call, but it is possible that the
flushCoordinator might not be set yet and the second handleStartFlush would end up setting
the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a
FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase().
IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to
avoid the above situation, in fact, all of the protected parts of the onStartFlush() call
should be made inside the protected section of transitionToFirstPhase().
The above should take care of the
didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet
problem.
Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the
new flushRequestor is < the current flushCoordinator the new requestor will win the
flush fight. This is a great strategy because all the members will get the same result
here.
If A, B, C are members and A < B < C and they all simultaneously request a flush,
then the desired effect is for everyone to decide that A will win the flush fight.
So the following code handles that situation...
private void handleStartFlush(Message msg, FlushHeader fh) {
...
}else if(oldPhase == FlushPhase.FIRST_PHASE){
...
if(flushRequester.compareTo(coordinator) < 0){
rejectFlush(fh.viewID, coordinator);
...
synchronized(sharedLock){
flushCoordinator = flushRequester;
}
}
If the new flushRequetor is < the old flushCoordinator than reject/abort the old
flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which
sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to
transitionToFirstPhase() then the flush members would never be set properly, so that
situation needs to be resolved as well.
I think this might be the problem I'm seeing because I wait indefinitely for my flush
to succeed.
I will be testing a patch this afternoon and hope to report back.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira