On 17 Apr 2013, at 08:23, Dan Berindei <dan.berindei(a)gmail.com> wrote:
I like the idea of always clearing the state in members of the
minority partition(s), but one problem with that is that there may be some keys that only
had owners in the minority partition(s). If we wiped the state of the minority partition
members, those keys would be lost.
Right, this is my concern with such a wipe as well.
Of course, you could argue that the cluster already lost those keys
when we allowed the majority partition to continue working without having those keys... We
could also rely on the topology information, and say that we only support partitioning
when numOwners >= numSites (or numRacks, if there is only one site, or numMachines, if
there is a single rack).
This is only true for an embedded app. For an app communicating with the cluster over Hot
Rod, this isn't the case as it could directly read from the minority partition.
One other option is to perform a more complicated post-merge state
transfer, in which each partition sends all the data it has to all the other partitions,
and on the receiving end each node has a "conflict resolution" component that
can merge two values. That is definitely more complicated than just going with a primary
partition, though.
That sounds massively expensive. I think the right solution at this point is entry
versioning using vector clocks and the vector clocks are exchanged and compared during a
merge. Not the entire dataset.
One final point... when a node comes back online and it has a local
cache store, it is very much as if we had a merge view. The current approach is to join as
if the node didn't have any data, then delete everything from the cache store that is
not mapped to the node in the consistent hash. Obviously that can lead to consistency
problems, just like our current merge algorithm. It would be nice if we could handle both
these cases the same way.
+1
--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani
Platform Architect, JBoss Data Grid
http://red.ht/data-grid