[
https://issues.jboss.org/browse/ISPN-3421?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-3421:
------------------------------------
I think the problem is that A leaves the cluster before it managed to send the
CommitCommand to all the replicas. E.g. B receives the commit, but C doesn't. When C
sees a new view without A in it, it deletes all transactions that originated on A in
{{TransactionTable.cleanupStaleTransactions()}} (at least if recovery is not enabled).
[~an1310], did I get it right?
I think fixing this would require changing the CH update that happens after a leave to be
done in two phases, just like rebalancing. In the first phase, all the members would
install the new CH and would stop accepting commands from the leaver. In the second phase,
each member would delete the pending transactions from the leaver, but first it would
check with the other owners if they received a commit command. If any owner received the
commit, the transaction should be committed, otherwise it should be removed.
Enabling recovery may help a bit, since C won't delete transactions originated from A
outright, but on the other hand it doesn't keep backup locks for them - new
transactions can update the same data while the originator is down, so we still have
inconsistencies.
State Transfer can leave keys on < numOwners nodes for
transactional caches.
----------------------------------------------------------------------------
Key: ISPN-3421
URL:
https://issues.jboss.org/browse/ISPN-3421
Project: Infinispan
Issue Type: Bug
Components: State transfer
Affects Versions: 5.2.7.Final
Reporter: Erik Salter
Assignee: Dan Berindei
Priority: Critical
There's a hole in state transfer mechanism that can occur when a node is leaving the
cluster, but it was creating the entries and was only able to replicate the data to some
of the nodes.
The problem occurs when the segment ownership of the node doesn't change after the
rebalance. Since state transfer does not request state for keys in which it is already an
owner, the cache could be left in a state where a key is resident < numOwners nodes.
In addition, this could be any subset of the primary OR backup nodes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira