On 17 Apr 2013, at 13:23, Dan Berindei wrote:
I like the idea of always clearing the state in members of the
minority partition(s), but one problem with that is that there may be some keys that only
had owners in the minority partition(s). If we wiped the state of the minority partition
members, those keys would be lost.
Indeed, data consistency is lost the moment we
have a partition with numOwners members. So with the read-only cluster approach, we can
only target *eventual* consistency - that's when the partitions are merged.
Of course, you could argue that the cluster already lost those keys when we allowed the
majority partition to continue working without having those keys... We could also rely on
the topology information, and say that we only support partitioning when numOwners >=
numSites (or numRacks, if there is only one site, or numMachines, if there is a single
rack).
Good point re: topology.
That assumes that there won't be any split brains in the same site (or rack), which
I'm not sure stands true in general. Bela care to comment?
One other option is to perform a more complicated post-merge state transfer, in which
each partition sends all the data it has to all the other partitions, and on the receiving
end each node has a "conflict resolution" component that can merge two values.
That is definitely more complicated than just going with a primary partition, though.
+1
One final point... when a node comes back online and it has a local cache store, it is
very much as if we had a merge view. The current approach is to join as if the node
didn't have any data, then delete everything from the cache store that is not mapped
to the node in the consistent hash.
With this approach a value that has been deleted
within the cluster might resurrect. Wouldn't it be better to delete everything from
the cache store?
Obviously that can lead to consistency problems, just like our
current merge algorithm. It would be nice if we could handle both these cases the same
way.
+1. The cache store is the equivalent of the read only partition.
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)