[JBoss JIRA] (ISPN-4137) Transaction executed multiple times due to forwarded CommitCommand

Wednesday, 26 March 2014

    [
https://issues.jboss.org/browse/ISPN-4137?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-4137:
------------------------------------

{quote}
I don't see the point in committing indefinitely. There's now way how the
CommitCommand can be lost - JGroups should deliver it reliably, and we are not dropping
delivered commands anywhere in Infinispan (or do we?). It can just take a while before it
is delivered and responded. The only consideration for any resending is sending to new
nodes.
{quote}

The CommitCommand won't be lost, but that's not the problem. The problem is that
once the originator times out waiting for a backup's response, it will send a
RollbackCommand which will release the lock on the primary owner, so it's no longer ok
to run the CommitCommand on the backup. If we retried the CommitCommand, that would mean
the locks on the primary would still be valid at the time (one of) the CommitCommands is
executed on the backup. Mircea suggested another solution here: change RollbackCommand to
no longer release the locks, and to require a separate TxCompletionNotificationCommand.
That way, the backup will either commit while holding the lock, or roll back.

{quote}
What's the contract, anyway? When the commit() throws exception, are there any
guarantees that none of the operations were written? Is this described anywhere?
{quote}

I'm not sure if it's written anywhere, but if we throw an exception during commit
then there are no guarantees as to what is written on which nodes. However, I think you
were right to create this issue, as overwriting another transaction's data
shouldn't be allowed.

{quote}
If there are no such guarantees, trying to finish the TX with commit even if exception was
reported on originator is IMO better than send a rollback (and hope things will settle) or
keeping the locks stale. If there are any such guarantees, we can't do anything, and
we should rather keep the lock stale (blocking further txs) than break the contract.
Thinking about it again, there can't be any guarantees because the commit can be
already executed - the contract would be broken.
{quote}

Trying to commit the tx without holding the proper locks isn't good either, as you
risk breaking other transactions as well. We could try to eliminate the replication
timeout for commit commands completely, but that would still leave a stale lock if there
is a bug somewhere in the commit code. I think it would be better to allow a bigger
timeout for commit commands, but keep a timeout nonetheless.

{quote}
When the originator dies after prepare, the transaction keeps hanging anyway. Is it then
reported in-doubt in recovery?
{quote}

Yes, {{RecoveryAwareTransactionTable.cleanupStaleTransactions()}} moves the tx from the tx
table to the recovery cache. If recovery is not enabled, the tx is just rolled back.
Either way, there is no coordination between the owners at this point, so if there is a
failure during commit, some of the owners may commit (without holding the key locks) and
some may roll back. There is already a bug for this, but I can't find it at the
moment.

...
 Transaction executed multiple times due to forwarded CommitCommand
 ------------------------------------------------------------------

                 Key: ISPN-4137
                 URL: https://issues.jboss.org/browse/ISPN-4137
             Project: Infinispan
          Issue Type: Bug
          Components: State Transfer, Transactions
            Reporter: Radim Vansa
            Assignee: Dan Berindei
            Priority: Critical

 When the {{StateTransferInterceptor}} forwards a CommitCommand for the new topology,
multiple CommitCommands may be broadcast across the cluster. If the command (forwarded
already from originator) times out, the transaction may be correctly finished by the first
one and the application considers TX as succeeded (useSynchronizations=true), although one
more Rollback is sent as well.
 Then, again in STI, when the CommitCommand arrives with higher topologyId than the one
used for the first TX execution, another artificial Prepare (followed by the commit) is
executed - see {{STI.visitCommitCommand}}.
 However, this execution may be delayed a lot and originator may have already executed
another TX on the same entries. Then, this forwarded Commit will overwrite the already
updated entries, causing inconsistency of data. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009