Dan Berindei commented on ISPN-5016:
> Because data is stored in the shared store after the invalidation
command was executed on all the nodes, it’s also possible for a node to read the old value
from the shared store after it has executed the invalidation.
I don't understand the reason very well, could you elaborate a bit more?
> Invalidations triggered by write operations are synchronous and
the operation will fail with a SuspectException if another node crashes. The entry will
not be updated on the originator, but invalidation will still be performed on the
I thought SuspectExceptions are silently handled, aren't they?
Nope, they are silently handled only in some cases (e.g. L1 invalidation commands), but
not in all cases (e.g. prepare commands, or regular invalidation commands). It does sound
like a bug, I'll create one.
> For multi-key operations like putAll(), the operation is sent to
all the primary owners of the affected keys. When there is a single primary owner for all
the affected keys, the operation succeeds and its effects are visible atomically. When
there are multiple primary owners, the operation is not atomic: overlapping read
operations will see random subsets of the updated values.
As reads don't acquire any locks, I think that even update on single primary owner
can be seen as non-atomic.
Yeah, I was trying to say that it preserve the semantics of a local cache, but multi-key
operations in a local cache aren't atomic either.
> If a node is suspected because of a Full GC, it might go from the
initial JGroups view straight to the merge view. If that happens, its topology will be the
largest one, and it will not be wiped, neither will it receive new data. Instead, it will
keep the (possibly stale) entries it had before the Full GC.
Sorry, this is not very clear to me. Why largest topology ID and why wouldn't that
propagate its state to all other members (wiping their data)?
It's because the topology with the most nodes is elected as the merge topology. It
seemed like a good idea when I implemented it, but perhaps it would have been better to
use the one with the largest topology id instead.
> If at least half of the nodes in the stable topology leave in
What is quick succession? Does that depend on JGroups view installation, any timeout or
the duration of rebalance?
It means before the stable topology is updated again. In replicated mode there is no state
transfer when a node leaves, that means the coordinator just needs to receive the
rebalance confirmation from all the nodes.
> And if some of the nodes in the Available partition’s consistent
hash are not really accessible after the merge, the cache might stay Degraded.
'the cache'? We should be talking rather about partitions, not the whole cache. I
understand the first part of the special case, but not the latter.
Changed to 'the merged partition'.
> If a node joins and becomes a backup owner after a write command
was sent from the primary owner to the backups, but before the primary owner updates its
own data container, it may not receive the value neither as a write command nor via state
Sounds like a bug to me - is there a JIRA that could be linked? We could tolerate
inconsistencies when the node crashes (if we can't fix it), but join or graceful leave
should keep the cluster consistent.
I don't think it's any worse than the inconsistency in the previous paragraph.
> When a write to the store fails, it will fail the write
Does this hold for write-behind, too?
That's the next paragraph...
> With write-behind or asynchronous replication enabled, store
write failures are hidden from the user (unless the originator is the primary owner, with
When originator == primary and write-behind, how can the failure to write to the store be
propagated to the user? I thought that user thread initiates both write to the store and
async replication and returns.
Reworded, I meant with async replication but without write-behind.
> When the partitions merge back, there is no effort to replicate
the values from one partition to another.
Why is that different from non-tx mode, where the partitions with non-highest topology id
are wiped? Moreover, in optimistic tx you write
> Same as with pessimistic and non-transactional caches.
- what version, then?
It's actually the same, I added that to the pessimistic mode description.
> If the primary owners of the keys written by the transaction are
all in the local transaction,
> If one partition stays available, its entries will replace all
the other partitions' entries on merge, undoing partial commits in those partitions.
Do I understand correctly that degraded partition may commit transaction, and this
transaction will be later ignored (the data will later be overwritten to the previous
values). Why is this behaviour desired?
Simplicity: whenever we have a merge, we pick one winning partition, and we wipe the nodes
in other partitions.
Ideally one partition shouldn't be able to commit when it doesn't have a majority
of nodes, but there's no way to prevent partial commits with single-phase commit.
> Transactions already prepared, however, will commit successfully
even in minority partitions
Is that true even if the originator is not in this minority partition?
I intended to say that the tx will succeed (will not throw any exception) even if the
originator ends up in a minority partition.
Obviously the same applies if the originator's partition stays Available.
> When a transaction needs to acquire more than one key lock with
the same primary node, they are always acquired in the same order, so this will not cause
If the keys have the same hashCodes, they can be locked in different order, though.
> The commit is always synchronous on the originator, so a
transaction T3 started on node A after T1 finished will see T1’s updates.
Will it see all T1's updates, or just updates on those entries owned by A?
This is replicated mode, so it's all the updates.
> The write to the attached cache store(s) is performed during the
one-phase prepare command or the commit command, depending on the configuration.
What configuration, exactly?
Basically with pessimistic locking or async replication, you have a one-phase prepare
command, otherwise you have separate prepare and commit commands.
I've also added a section for use1PcForAutoCommitTransactions, which uses another
one-phase commit mode.
> TODO What happens if a node is a backup owner for two of the
Hmm? The primary owner sends the PutMapCommand to all nodes that are backup owners of at
least one of the keys primary-owned by this node. But you're right that I am not sure
whether this command makes the backup node write only
Yeah, the backup doesn't do any check on who forwarded the command so it can update
the wrong entries.
> When a node tries to read a key it does not own, it sends a
AFAIK the node checks if the entry is in data container, not whether it is owned.
That changed with ISPN-4444.
Regarding the situation that an entry is lost, I think that you
should note somewhere that further attempts to read it will simply return null, not
throwing any exceptions (the user could expect that).
Both can happen, depending on the partition handling configuration, so I've added a
I lack explicit resolution on ISPN-4995, regarding reading and
writing two (or more) entries and expectation whether any read is already visible.
I've added another paragraph to cover the "after the update is finished, other
processes can see only the new value" rule in the ISPN-4995 scenario.
I only had it covered in replicated mode, where minority partition nodes can also read
stale values until they enter degraded mode.
Specify and document cache consistency guarantees
Issue Type: Task
Affects Versions: 7.0.2.Final
Reporter: Radim Vansa
Assignee: Dan Berindei
We can't simply use the consistency model defined by Java Specification and broaden
it for whole cache (maybe the expression "can't" is too strong, but we
definitely don't want to do that in some cases).
By consistency guarantees/model I mean mostly in which order are
writes allowed to be observed: and we can't boil it down to simply
causal, PRAM or any other consistency model as writes can be observed as non-atomic in
Infinispan documentation is quite scarce about that, the only trace I've
found is in Glossarry  "Infinispan has traditionally followed ACID
principles as far as possible, however an eventually consistent mode
embracing BASE is on the roadmap."
This message was sent by Atlassian JIRA