[
https://issues.jboss.org/browse/JGRP-1282?page=com.atlassian.jira.plugin....
]
Dennis Reed updated JGRP-1282:
------------------------------
Description:
There's a race condition in FLUSH when the master node is leaving the cluster,
that can cause the master to not send a new view (with a new master) before leaving.
The FLUSH is started when GMS sends down an Event.SUSPEND.
FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
onSuspend sends a START_FLUSH message down.
In the working case, the local node gets the START_FLUSH first.
FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
onStartFlush sets the member variable "flushMembers".
Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
FLUSH.up calls FLUSH.onFlushCompleted.
onFlushCompleted checks "flushMembers" against the list of replies.
If they match (and flushMembers is not null), the flush completes.
But in the non-working case, the FLUSH_COMPLETED from the other
nodes is processed before the local START_FLUSH.
In this case, flushMembers has not been set, and onFlushCompleted
does nothing, expecting more replies (which never come).
I believe this will only be triggered when the master is leaving,
because it does not include itself in the FLUSH. If it was a flush
member, there would be a FLUSH_COMPLETED reply from itself to
trigger setting flushMembers at some point.
was:
There's a race condition in FLUSH when the master node is leaving the cluster,
that can cause the master to not send a new view (with a new master) before leaving.
The FLUSH is started when GMS sends down an Event.SUSPEND.
FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
onSuspend sends a START_FLUSH message down.
In the working case, the local node gets the START_FLUSH first.
FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
onStartFlush sets the member variable "flushMembers".
Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
FLUSH.up calls FLUSH.onFlushCompleted.
onFlushCompleted checks "flushMembers" against the list of replies.
If they match (and flushMembers is not null), the flush completes.
But in the non-working case, the FLUSH_COMPLETED from the other
nodes is processed before the local START_FLUSH.
In this case, flushMembers has not been set, and onFlushCompleted
does nothing, expecting more replies (which never come).
Race condition in FLUSH when master leaves cluster
--------------------------------------------------
Key: JGRP-1282
URL:
https://issues.jboss.org/browse/JGRP-1282
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.16
Reporter: Dennis Reed
Assignee: Bela Ban
There's a race condition in FLUSH when the master node is leaving the cluster,
that can cause the master to not send a new view (with a new master) before leaving.
The FLUSH is started when GMS sends down an Event.SUSPEND.
FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
onSuspend sends a START_FLUSH message down.
In the working case, the local node gets the START_FLUSH first.
FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
onStartFlush sets the member variable "flushMembers".
Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
FLUSH.up calls FLUSH.onFlushCompleted.
onFlushCompleted checks "flushMembers" against the list of replies.
If they match (and flushMembers is not null), the flush completes.
But in the non-working case, the FLUSH_COMPLETED from the other
nodes is processed before the local START_FLUSH.
In this case, flushMembers has not been set, and onFlushCompleted
does nothing, expecting more replies (which never come).
I believe this will only be triggered when the master is leaving,
because it does not include itself in the FLUSH. If it was a flush
member, there would be a FLUSH_COMPLETED reply from itself to
trigger setting flushMembers at some point.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira