[
https://issues.jboss.org/browse/ISPN-4137?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-4137:
------------------------------------
While trying to write a test for the issue, I realized that the issue isn't actually
related to state transfer. The only link with state transfer is that it may be more likely
for a commit command to time out if it's waiting for a node to install a new topology,
or forwarding the commit to a new node which is itself waiting for a new topology.
Let's say we have a transaction with a put(k, v) command, the originator is A, and the
key owners are B (primary) and C (backup).
Let's also assume the local commit on either node can't fail, the only possible
failure is replication timeout.
If the commit command sent from A to C times out, A will send a rollback command to B and
C, and there are two cases:
1. C applies the commit before receiving the rollback command, and writes {{k=v}} in the
cache without B holding the lock on {{k}} - allowing it to overwrite another transaction.
2. C receives the rollback command and skips the commit command, leaving {{k=v}} on B and
{{k=null}} on C.
The only way out of this is to not send the rollback command at all, and use recovery to
force the commit or rollback on A - but blocking any transactions that want to write to
{{k}} in the meantime. When recovery is enabled, this is what my fix does, but I'm not
sure if holding the lock on {{k}} for in-doubt transactions is ok. [~mircea.markus],
WDYT?
There is a slightly different problem that my PR does fix: if the commit succeeds on both
B and C, but A sees a topology change, it will re-send the commit command to both B and C.
Without the change, B and C will both replay both the prepare and the commit, allowing for
inconsistencies. But with the change, the transaction is seen as already completed and B
and C do nothing.
Transaction executed multiple times due to forwarded CommitCommand
------------------------------------------------------------------
Key: ISPN-4137
URL:
https://issues.jboss.org/browse/ISPN-4137
Project: Infinispan
Issue Type: Bug
Components: State Transfer, Transactions
Reporter: Radim Vansa
Assignee: Dan Berindei
Priority: Critical
When the {{StateTransferInterceptor}} forwards a CommitCommand for the new topology,
multiple CommitCommands may be broadcast across the cluster. If the command (forwarded
already from originator) times out, the transaction may be correctly finished by the first
one and the application considers TX as succeeded (useSynchronizations=true), although one
more Rollback is sent as well.
Then, again in STI, when the CommitCommand arrives with higher topologyId than the one
used for the first TX execution, another artificial Prepare (followed by the commit) is
executed - see {{STI.visitCommitCommand}}.
However, this execution may be delayed a lot and originator may have already executed
another TX on the same entries. Then, this forwarded Commit will overwrite the already
updated entries, causing inconsistency of data.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira