[jboss-jira] [JBoss JIRA] (WFCORE-996) Race between commit and cancel messages if a domain rollout is cancelled

Brian Stansberry (JIRA) issues at jboss.org
Mon Sep 21 22:13:00 EDT 2015


Brian Stansberry created WFCORE-996:
---------------------------------------

             Summary: Race between commit and cancel messages if a domain rollout is cancelled
                 Key: WFCORE-996
                 URL: https://issues.jboss.org/browse/WFCORE-996
             Project: WildFly Core
          Issue Type: Bug
          Components: Domain Management
    Affects Versions: 2.0.0.CR3
            Reporter: Brian Stansberry


There are situations where the DomainRolloutStepHandler and DomainSlaveHandler may send a cancel message to the remote server or slave HC immediately after sending the tx commit/rollback message. This can happen in particular if a server is hanging after the commit messages are sent by DomainRolloutStepHandler and the user cancels while DomainRolloutStepHandler is waiting for final response. When execution proceeds to DomainSlaveHandler, it will send the commit messages to the slave HCs, and then, because the op is cancelled, immediately send cancel messages.

The problem is there's no guarantee of the order in which the messages will be processed by the recipient. And both the commit/rollback message and the cancel message use the ModelControllerProtocol.COMPLETE_TX_REQUEST message, with cancel using the ModelControllerProtocol.PARAM_ROLLBACK param. The effect of this is the remote process may process a COMPLETE_TX_REQUEST/PARAM_ROLLBACK representing the cancel *before* it processes a COMPLETE_TX_REQUEST/PARAM_COMMIT representing the tx commit. As a result tx will be rolled back on that process, even though the correct instruction is *commit*. The process will therefore be out of sync with the domain.

A simple way to reduce the possibility of this is to make DomainSlaveHandler a bit more patient when waiting for the final response from slave HCs when it knows it needs to cancel. Even if we can do more for current version slaves (see below), we should still do this in order to reduce the risk for legacy slaves.

Beyond that, we can *perhaps* change the messages to distinguish the cancel case from the rollback case. A whole new message type is one possibility, but then we'd need a new protocol version and would have to track the protocol version used by the slave and continue to send the old COMPLETE_TX_REQUEST message to legacy slaves. Simpler may be to introduce a new PARAM_CANCEL value for the message payload. Legacy slaves will still treat that as a rollback instruction, since by luck the TransactionalProtocolOperationHandler.CompleteTxOperationHandler treats any param value other than PARAM_COMMIT as meaning rollback. Current slaves though can use new handling for PARAM_CANCEL.

*Possible* handling for PARAM_CANCEL:

1) if ExecuteRequestContext.prepared == false --> cancel before prepare; go ahead and cancel
2) else if ERC.txCompleted == false --> set a new ERC.cancelled flag and wait for commit/rollback
3) else commit/rollback has already arrived; go ahead and cancel

The handling for commit/rollback would check ERC.cancelled after doing the commit'rollback and if true cancel the request.

This possible approach needs careful thought before being introduced though.



--
This message was sent by Atlassian JIRA
(v6.4.11#64026)


More information about the jboss-jira mailing list