[
https://issues.jboss.org/browse/ISPN-4137?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-4137:
------------------------------------
{quote}
I agree that the Rollback should not be sent after timeout in commit. But I don't
understand what would we gain if the lock release was in TxCompletion - the Rollback might
then release the few resources. For further writes, the lock is held all the time - no
difference - and reads without force write lock can return current or future value anyway
(even in correct situations the transaction commit is not atomic with respect to reads).
It could be even worse - if you allowed to rollback after commit, you could read
uncommitted value for a while.
Maybe an example sequence commands would help me understanding.
{quote}
What we gain is that we can release the locks with the TxCompleteNotificationCommand,
knowing that the backup won't try to write anything after the primary released the
lock. The rollback command won't release any locks, but it will mark the transaction
as completed and it will prevent the commit command from writing anything.
{quote}
I am not suggesting that - if Prepare was successful, once we sent the Commit, we have to
kick all owners to commit the transaction, eventually. Only in the case that all nodes are
found prepared and originator is dead, we may rollback the commit.
If primary keeps alive, it should synchronize further transactions by holding the lock
(and releasing with commit) - backups don't matter, they will either commit the new
value as well (the lock is held on primary), or the one change won't be spotted. If
the primary dies, new primary should acquire the lock and not release it until it receives
the commit command, synchronizing the writes again.
{quote}
The primary has to receive a TxCompletedNotificationCommand in order to release the locks.
Who do you suggest should send the TxCompletionNotificationCommand, if not the originator,
after it received the response from all the owners?
Off-topic, if the primary dies, the new primary can't acquire any real locks - there
may be more than one prepared tx writing to the same key.
{quote}
There should be no reason for a prepared transaction to fail. Network failures should be
fixed by JGroups, eventually. Timeouts should release application threads, potentially
some worker threads to do another work, but these should not change the logical output of
the operation.
{quote}
That's an interesting approach, but I don't see any way of implementing that right
now - when we get a timeout from an RPC, we can't spawn another thread to wait for the
"real" response. So we have to either wait forever, or set a timeout and somehow
release the locks when a timeout occurs without causing more inconsistencies than
necessary.
There's also a problem with reporting success before the transaction committed on all
the owners. A subsequent get(k) on the same thread may return the value from the node that
didn't commit put(k, v) yet, so the user would see an inconsistency.
{quote}
Bugs in code may happen, but user should find out that there's something wrong
(locking repeatedly fails due to stale lock) - that's where the human intervention
through recovery could be useful. For handling bugs in production.
{quote}
The user is already notified if the commit fails on one of the nodes - he gets an
heuristic exception from {{commit()}} (unless using synchronization, but that's a
separate issue).
Keeping the locks may be better for some (or most) of the users. But considering how long
it may take an administrator to notice the in-doubt transaction, I don't see it as
clearly better than the option of releasing the locks and re-acquiring them when the
administrator force-commits the tx.
{quote}
So this is not correct either - some nodes may have already committed it. We can't
rollback unless we're sure that nobody committed, that should be a firm rule when
designing the system.
{quote}
Yeah, it's not correct, that's why we have a bug for it - ISPN-3421.
But I can't agree with you on that rule, the constant {{XAException.XA_HEURMIX}} is
exactly for this kind of situation.
Transaction executed multiple times due to forwarded CommitCommand
------------------------------------------------------------------
Key: ISPN-4137
URL:
https://issues.jboss.org/browse/ISPN-4137
Project: Infinispan
Issue Type: Bug
Components: State Transfer, Transactions
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
When the {{StateTransferInterceptor}} forwards a CommitCommand for the new topology,
multiple CommitCommands may be broadcast across the cluster. If the command (forwarded
already from originator) times out, the transaction may be correctly finished by the first
one and the application considers TX as succeeded (useSynchronizations=true), although one
more Rollback is sent as well.
Then, again in STI, when the CommitCommand arrives with higher topologyId than the one
used for the first TX execution, another artificial Prepare (followed by the commit) is
executed - see {{STI.visitCommitCommand}}.
However, this execution may be delayed a lot and originator may have already executed
another TX on the same entries. Then, this forwarded Commit will overwrite the already
updated entries, causing inconsistency of data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira