[infinispan-issues] [JBoss JIRA] (ISPN-4137) Transaction executed multiple times due to forwarded CommitCommand

Wed Mar 26 05:37:12 EDT 2014

    [ https://issues.jboss.org/browse/ISPN-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12956438#comment-12956438 ] 

Dan Berindei commented on ISPN-4137:
------------------------------------

{quote}
I don't see the point in committing indefinitely. There's now way how the CommitCommand can be lost - JGroups should deliver it reliably, and we are not dropping delivered commands anywhere in Infinispan (or do we?). It can just take a while before it is delivered and responded. The only consideration for any resending is sending to new nodes.
{quote}

The CommitCommand won't be lost, but that's not the problem. The problem is that once the originator times out waiting for a backup's response, it will send a RollbackCommand which will release the lock on the primary owner, so it's no longer ok to run the CommitCommand on the backup. If we retried the CommitCommand, that would mean the locks on the primary would still be valid at the time (one of) the CommitCommands is executed on the backup. Mircea suggested another solution here: change RollbackCommand to no longer release the locks, and to require a separate TxCompletionNotificationCommand. That way, the backup will either commit while holding the lock, or roll back.

{quote}
What's the contract, anyway? When the commit() throws exception, are there any guarantees that none of the operations were written? Is this described anywhere?
{quote}

I'm not sure if it's written anywhere, but if we throw an exception during commit then there are no guarantees as to what is written on which nodes. However, I think you were right to create this issue, as overwriting another transaction's data shouldn't be allowed.

{quote}
If there are no such guarantees, trying to finish the TX with commit even if exception was reported on originator is IMO better than send a rollback (and hope things will settle) or keeping the locks stale. If there are any such guarantees, we can't do anything, and we should rather keep the lock stale (blocking further txs) than break the contract. Thinking about it again, there can't be any guarantees because the commit can be already executed - the contract would be broken.
{quote}

Trying to commit the tx without holding the proper locks isn't good either, as you risk breaking other transactions as well. We could try to eliminate the replication timeout for commit commands completely, but that would still leave a stale lock if there is a bug somewhere in the commit code. I think it would be better to allow a bigger timeout for commit commands, but keep a timeout nonetheless.

{quote}
When the originator dies after prepare, the transaction keeps hanging anyway. Is it then reported in-doubt in recovery?
{quote}

Yes, {{RecoveryAwareTransactionTable.cleanupStaleTransactions()}} moves the tx from the tx table to the recovery cache. If recovery is not enabled, the tx is just rolled back.
Either way, there is no coordination between the owners at this point, so if there is a failure during commit, some of the owners may commit (without holding the key locks) and some may roll back. There is already a bug for this, but I can't find it at the moment.

> Transaction executed multiple times due to forwarded CommitCommand
> ------------------------------------------------------------------
>
>                 Key: ISPN-4137
>                 URL: https://issues.jboss.org/browse/ISPN-4137
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State Transfer, Transactions
>            Reporter: Radim Vansa
>            Assignee: Dan Berindei
>            Priority: Critical
>
> When the {{StateTransferInterceptor}} forwards a CommitCommand for the new topology, multiple CommitCommands may be broadcast across the cluster. If the command (forwarded already from originator) times out, the transaction may be correctly finished by the first one and the application considers TX as succeeded (useSynchronizations=true), although one more Rollback is sent as well.
> Then, again in STI, when the CommitCommand arrives with higher topologyId than the one used for the first TX execution, another artificial Prepare (followed by the commit) is executed - see {{STI.visitCommitCommand}}.
> However, this execution may be delayed a lot and originator may have already executed another TX on the same entries. Then, this forwarded Commit will overwrite the already updated entries, causing inconsistency of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira