[jboss-jira] [JBoss JIRA] Commented: (JGRP-756) FLUSH still needs work

Wed May 21 11:10:07 EDT 2008

    [ http://jira.jboss.com/jira/browse/JGRP-756?page=comments#action_12413549 ] 

Vladimir Blagojevic commented on JGRP-756:
------------------------------------------

Michael, very good catch! I think it is enough to check if abortFlushCoordinator != proceedFlushCoordinator and only then send reject message. Thanks for spotting this one!

> FLUSH still needs work
> ----------------------
>
>                 Key: JGRP-756
>                 URL: http://jira.jboss.com/jira/browse/JGRP-756
>             Project: JGroups
>          Issue Type: Bug
>            Reporter: Bela Ban
>         Assigned To: Vladimir Blagojevic
>             Fix For: 2.7, 2.6.3
>
>
> [Michael Newcomb]
> Still debugging concurrent starting issues... Now I'm running into a
> problem with FLUSH.
> So, there are 3 current members (A, B, C) and a new one joins (D)...
> 1. coord starts a flush on A,B,C
> 2. coord receives FLUSH_COMPLETED from A,B (misses C)
> 3. coord times out and sleeps a few seconds
> 4. coord starts a new flush on A,B,C
> Here is where the problems start. A,B (and possibly C) are already in a
> FLUSH situation. As far as they are concerned a flush is in progress
> because they sent FLUSH_COMPLETED to the coord.
> So, when they get a new flush, they determine who they are going to
> reject (either the currently flushing coordinator or the flush
> requestor).
> If the flush requestor is < than the current flush coordinator, then a
> reject flush is sent to the original flush coordinator and the flush is
> proceeded with the flush requestor.
> If the flush requestor is > than the current flush coordinator, then a
> reject flush is sent to the flush requestor and the flush is proceeded
> with the original flush coordinator.
> If the flush requestor is == the current flush coordinator, it behaves
> the same as if the flush requestor was > the flush coordinator. A reject
> flush is sent to the current coordinator and then a FLUSH_COMPLETED is
> sent to him...
> The problem is that the FLUSH_COMPLETED is basically ignored because the
> reject flush sets the promise to FALSE which immediately fails the
> flush. This causes another flush retry which results in the same thing
> again and again until all the retries are exhausted and the overall
> flush fails. Furthermore, the node that rejected the flush is left in
> the exact same state: he thinks he is in a flush and will reject any new
> flush requests by the current flush coordinator!
> Essentially, retrying flushes is a waste of time...
> I think that there are several ways to solve this problem.
> Since the flush is 'restarted' (onStartFlush is called after the reject
> is sent) even when the flush requestor == the current flush coordinator,
> there may be no need to reject the flush when the flush requestor == the
> current flush coordinator. Only send a reject flush if the
> abortFlushCoordinator != proceedFlushCoordinator...
> If that is not sufficient, then when the flush requestor == the current
> flush coordinator, the node that rejects a flush, should not 'restart'
> the flush by calling onStartFlush again (only call onStartFlush if
> abortFlushCoordinator != proceedFlushCoordinator). This basically sets
> the next flush attempt up for failure again and again because nothing
> has changed at the node: he still thinks a flush is on going and will
> reject any new flushes from the current flush coordinator.
> Again, these cases are for when the flush requestor is == the current
> flush coordinator. I have yet to attempt concurrent flush attempts by
> different nodes  ;) 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira