[jboss-jira] [JBoss JIRA] Updated: (JGRP-1282) Race condition in FLUSH when master leaves cluster

Dennis Reed (JIRA) jira-events at lists.jboss.org
Wed Feb 2 16:57:39 EST 2011


     [ https://issues.jboss.org/browse/JGRP-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Reed updated JGRP-1282:
------------------------------

    Description: 
There's a race condition in FLUSH when the master node is leaving the cluster,
that can cause the master to not send a new view (with a new master) before leaving.

The FLUSH is started when GMS sends down an Event.SUSPEND.
FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
onSuspend sends a START_FLUSH message down.

In the working case, the local node gets the START_FLUSH first.
FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
onStartFlush sets the member variable "flushMembers".

Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
FLUSH.up calls FLUSH.onFlushCompleted.  
onFlushCompleted checks "flushMembers" against the list of replies.  
If they match (and flushMembers is not null), the flush completes.

But in the non-working case, the FLUSH_COMPLETED from the other
nodes is processed before the local START_FLUSH.
In this case, flushMembers has not been set, and onFlushCompleted
does nothing, expecting more replies (which never come).

I believe this will only be triggered when the master is leaving,
because it does not include itself in the FLUSH.  If it was a flush
member, there would be a FLUSH_COMPLETED reply from itself to
trigger setting flushMembers at some point.

  was:
There's a race condition in FLUSH when the master node is leaving the cluster,
that can cause the master to not send a new view (with a new master) before leaving.

The FLUSH is started when GMS sends down an Event.SUSPEND.
FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
onSuspend sends a START_FLUSH message down.

In the working case, the local node gets the START_FLUSH first.
FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
onStartFlush sets the member variable "flushMembers".

Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
FLUSH.up calls FLUSH.onFlushCompleted.  
onFlushCompleted checks "flushMembers" against the list of replies.  
If they match (and flushMembers is not null), the flush completes.

But in the non-working case, the FLUSH_COMPLETED from the other
nodes is processed before the local START_FLUSH.
In this case, flushMembers has not been set, and onFlushCompleted
does nothing, expecting more replies (which never come).



> Race condition in FLUSH when master leaves cluster
> --------------------------------------------------
>
>                 Key: JGRP-1282
>                 URL: https://issues.jboss.org/browse/JGRP-1282
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.16
>            Reporter: Dennis Reed
>            Assignee: Bela Ban
>
> There's a race condition in FLUSH when the master node is leaving the cluster,
> that can cause the master to not send a new view (with a new master) before leaving.
> The FLUSH is started when GMS sends down an Event.SUSPEND.
> FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
> onSuspend sends a START_FLUSH message down.
> In the working case, the local node gets the START_FLUSH first.
> FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
> onStartFlush sets the member variable "flushMembers".
> Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
> FLUSH.up calls FLUSH.onFlushCompleted.  
> onFlushCompleted checks "flushMembers" against the list of replies.  
> If they match (and flushMembers is not null), the flush completes.
> But in the non-working case, the FLUSH_COMPLETED from the other
> nodes is processed before the local START_FLUSH.
> In this case, flushMembers has not been set, and onFlushCompleted
> does nothing, expecting more replies (which never come).
> I believe this will only be triggered when the master is leaving,
> because it does not include itself in the FLUSH.  If it was a flush
> member, there would be a FLUSH_COMPLETED reply from itself to
> trigger setting flushMembers at some point.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list