[infinispan-issues] [JBoss JIRA] (ISPN-4137) Transaction executed multiple times due to forwarded CommitCommand

Wed Mar 26 08:51:13 EDT 2014

    [ https://issues.jboss.org/browse/ISPN-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12956528#comment-12956528 ] 

Dan Berindei commented on ISPN-4137:
------------------------------------

{quote}
I agree that the Rollback should not be sent after timeout in commit. But I don't understand what would we gain if the lock release was in TxCompletion - the Rollback might then release the few resources. For further writes, the lock is held all the time - no difference - and reads without force write lock can return current or future value anyway (even in correct situations the transaction commit is not atomic with respect to reads). It could be even worse - if you allowed to rollback after commit, you could read uncommitted value for a while.
Maybe an example sequence commands would help me understanding.
{quote}

What we gain is that we can release the locks with the TxCompleteNotificationCommand, knowing that the backup won't try to write anything after the primary released the lock. The rollback command won't release any locks, but it will mark the transaction as completed and it will prevent the commit command from writing anything.

{quote}
I am not suggesting that - if Prepare was successful, once we sent the Commit, we have to kick all owners to commit the transaction, eventually. Only in the case that all nodes are found prepared and originator is dead, we may rollback the commit.
If primary keeps alive, it should synchronize further transactions by holding the lock (and releasing with commit) - backups don't matter, they will either commit the new value as well (the lock is held on primary), or the one change won't be spotted. If the primary dies, new primary should acquire the lock and not release it until it receives the commit command, synchronizing the writes again.
{quote}

The primary has to receive a TxCompletedNotificationCommand in order to release the locks. Who do you suggest should send the TxCompletionNotificationCommand, if not the originator, after it received the response from all the owners?

Off-topic, if the primary dies, the new primary can't acquire any real locks - there may be more than one prepared tx writing to the same key.

{quote}
There should be no reason for a prepared transaction to fail. Network failures should be fixed by JGroups, eventually. Timeouts should release application threads, potentially some worker threads to do another work, but these should not change the logical output of the operation.
{quote}

That's an interesting approach, but I don't see any way of implementing that right now - when we get a timeout from an RPC, we can't spawn another thread to wait for the "real" response. So we have to either wait forever, or set a timeout and somehow release the locks when a timeout occurs without causing more inconsistencies than necessary.

There's also a problem with reporting success before the transaction committed on all the owners. A subsequent get(k) on the same thread may return the value from the node that didn't commit put(k, v) yet, so the user would see an inconsistency.

{quote}
Bugs in code may happen, but user should find out that there's something wrong (locking repeatedly fails due to stale lock) - that's where the human intervention through recovery could be useful. For handling bugs in production.
{quote}

The user is already notified if the commit fails on one of the nodes - he gets an heuristic exception from {{commit()}} (unless using synchronization, but that's a separate issue).

Keeping the locks may be better for some (or most) of the users. But considering how long it may take an administrator to notice the in-doubt transaction, I don't see it as clearly better than the option of releasing the locks and re-acquiring them when the administrator force-commits the tx.

{quote}
So this is not correct either - some nodes may have already committed it. We can't rollback unless we're sure that nobody committed, that should be a firm rule when designing the system.
{quote}

Yeah, it's not correct, that's why we have a bug for it - ISPN-3421.

But I can't agree with you on that rule, the constant {{XAException.XA_HEURMIX}} is exactly for this kind of situation.

> Transaction executed multiple times due to forwarded CommitCommand
> ------------------------------------------------------------------
>
>                 Key: ISPN-4137
>                 URL: https://issues.jboss.org/browse/ISPN-4137
>             Project: Infinispan
>          Issue Type: Bug
>          Components: State Transfer, Transactions
>            Reporter: Radim Vansa
>            Assignee: Dan Berindei
>            Priority: Critical
>
> When the {{StateTransferInterceptor}} forwards a CommitCommand for the new topology, multiple CommitCommands may be broadcast across the cluster. If the command (forwarded already from originator) times out, the transaction may be correctly finished by the first one and the application considers TX as succeeded (useSynchronizations=true), although one more Rollback is sent as well.
> Then, again in STI, when the CommitCommand arrives with higher topologyId than the one used for the first TX execution, another artificial Prepare (followed by the commit) is executed - see {{STI.visitCommitCommand}}.
> However, this execution may be delayed a lot and originator may have already executed another TX on the same entries. Then, this forwarded Commit will overwrite the already updated entries, causing inconsistency of data.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira