[jboss-jira] [JBoss JIRA] (JGRP-1895) FLUSH_NOT_COMPLETED race condition
Dennis Reed (JIRA)
issues at jboss.org
Mon Oct 27 17:30:35 EDT 2014
[ https://issues.jboss.org/browse/JGRP-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015435#comment-13015435 ]
Dennis Reed commented on JGRP-1895:
-----------------------------------
Specific case where this caused a major outage:
A new node was joining the cluster (JOIN_REQ_WITH_STATE_TRANSFER).
Cooordinator started flush.
nodeB was in GC and didn't respond, so the flush timed out.
Coordinator started another flush.
nodeB responded to the original flush with FLUSH_COMPLETED, and the coordinator thought it was a response for the current flush when it was not.
FLUSH completed (when it shouldn't have), and coordinator started a reconcile.
nodeB then responded to the current FLUSH with FLUSH_NOT_COMPLETED, which set flush_promise to false to indicate a failure.
Since the FLUSH actually succeeded, the coordinator did not abort it.
But since it was reported as failing, the new node did not join the cluster, and so never stopped the FLUSH.
FLUSH was then stuck in an incomplete state until the channel was restarted.
> FLUSH_NOT_COMPLETED race condition
> ----------------------------------
>
> Key: JGRP-1895
> URL: https://issues.jboss.org/browse/JGRP-1895
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.6.16
> Reporter: Dennis Reed
> Assignee: Dennis Reed
>
> FLUSH_NOT_COMPLETED does not keep track of which FLUSH it was related to. FLUSH_COMPLETED only keeps track of whether the FLUSH was for the current view.
> If these responses are delayed (which can be G/caused by
> a long GC pause on the node sending it) where a new FLUSH has been started, these can be interpreted as responses for the wrong FLUSH.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
More information about the jboss-jira
mailing list