[
https://issues.jboss.org/browse/ISPN-4137?page=com.atlassian.jira.plugin....
]
Radim Vansa commented on ISPN-4137:
-----------------------------------
{quote}
The CommitCommand won't be lost, but that's not the problem. The problem is that
once the originator times out waiting for a backup's response, it will send a
RollbackCommand which will release the lock on the primary owner, so it's no longer ok
to run the CommitCommand on the backup. If we retried the CommitCommand, that would mean
the locks on the primary would still be valid at the time (one of) the CommitCommands is
executed on the backup. Mircea suggested another solution here: change RollbackCommand to
no longer release the locks, and to require a separate TxCompletionNotificationCommand.
That way, the backup will either commit while holding the lock, or roll back.
{quote}
I agree that the Rollback should not be sent after timeout in commit. But I don't
understand what would we gain if the lock release was in TxCompletion - the Rollback might
then release the few resources. For further writes, the lock is held all the time - no
difference - and reads without force write lock can return current or future value anyway
(even in correct situations the transaction commit is not atomic with respect to reads).
It could be even worse - if you allowed to rollback after commit, you could read
uncommitted value for a while.
Maybe an example sequence commands would help me understanding.
{quote}
Trying to commit the tx without holding the proper locks isn't good either, as you
risk breaking other transactions as well. We could try to eliminate the replication
timeout for commit commands completely, but that would still leave a stale lock if there
is a bug somewhere in the commit code. I think it would be better to allow a bigger
timeout for commit commands, but keep a timeout nonetheless.
{quote}
I am not suggesting that - if Prepare was successful, once we sent the Commit, we have to
kick all owners to commit the transaction, eventually. Only in the case that all nodes are
found prepared and originator is dead, we may rollback the commit.
If primary keeps alive, it should synchronize further transactions by holding the lock
(and releasing with commit) - backups don't matter, they will either commit the new
value as well (the lock is held on primary), or the one change won't be spotted. If
the primary dies, new primary should acquire the lock and not release it until it receives
the commit command, synchronizing the writes again.
There should be no reason for a prepared transaction to fail. Network failures should be
fixed by JGroups, eventually. Timeouts should release application threads, potentially
some worker threads to do another work, but these should not change the logical output of
the operation.
Bugs in code may happen, but user should find out that there's something wrong
(locking repeatedly fails due to stale lock) - that's where the human intervention
through recovery could be useful. For handling bugs in production.
{quote}
If recovery is not enabled, the tx is just rolled back.
{quote}
So this is not correct either - some nodes may have already committed it. We can't
rollback unless we're sure that nobody committed, that should be a firm rule when
designing the system.
Transaction executed multiple times due to forwarded CommitCommand
------------------------------------------------------------------
Key: ISPN-4137
URL:
https://issues.jboss.org/browse/ISPN-4137
Project: Infinispan
Issue Type: Bug
Components: State Transfer, Transactions
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
When the {{StateTransferInterceptor}} forwards a CommitCommand for the new topology,
multiple CommitCommands may be broadcast across the cluster. If the command (forwarded
already from originator) times out, the transaction may be correctly finished by the first
one and the application considers TX as succeeded (useSynchronizations=true), although one
more Rollback is sent as well.
Then, again in STI, when the CommitCommand arrives with higher topologyId than the one
used for the first TX execution, another artificial Prepare (followed by the commit) is
executed - see {{STI.visitCommitCommand}}.
However, this execution may be delayed a lot and originator may have already executed
another TX on the same entries. Then, this forwarded Commit will overwrite the already
updated entries, causing inconsistency of data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira