On 17 Apr 2013, at 16:53, Manik Surtani wrote:
> Of course, you could argue that the cluster already lost those
keys when we allowed the majority partition to continue working without having those
keys... We could also rely on the topology information, and say that we only support
partitioning when numOwners >= numSites (or numRacks, if there is only one site, or
numMachines, if there is a single rack).
This is only true for an embedded app. For an app communicating with the cluster over
Hot Rod, this isn't the case as it could directly read from the minority partition.
I don't think it's that simple:
- k1 is in the minority partition *only*
- it is removed/updated in the majority partition
- then read in the minority partition
> One other option is to perform a more complicated post-merge state transfer, in which
each partition sends all the data it has to all the other partitions, and on the receiving
end each node has a "conflict resolution" component that can merge two values.
That is definitely more complicated than just going with a primary partition, though.
That sounds massively expensive.
if you move only the keys + version around (not
the values) shouldn't be that bad from a load perspective.
I think the right solution at this point is entry versioning using
vector clocks and the vector clocks are exchanged and compared during a merge. Not the
entire dataset.
+1
> One final point... when a node comes back online and it has a local cache store, it
is very much as if we had a merge view. The current approach is to join as if the node
didn't have any data, then delete everything from the cache store that is not mapped
to the node in the consistent hash. Obviously that can lead to consistency problems, just
like our current merge algorithm. It would be nice if we could handle both these cases the
same way.
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)