On 18 Apr 2013, at 06:44, Dan Berindei <dan.berindei(a)gmail.com> wrote:
On Wed, Apr 17, 2013 at 5:53 PM, Manik Surtani <msurtani(a)redhat.com> wrote:
On 17 Apr 2013, at 08:23, Dan Berindei <dan.berindei(a)gmail.com> wrote:
> I like the idea of always clearing the state in members of the minority partition(s),
but one problem with that is that there may be some keys that only had owners in the
minority partition(s). If we wiped the state of the minority partition members, those keys
would be lost.
Right, this is my concern with such a wipe as well.
> Of course, you could argue that the cluster already lost those keys when we allowed
the majority partition to continue working without having those keys... We could also rely
on the topology information, and say that we only support partitioning when numOwners
>= numSites (or numRacks, if there is only one site, or numMachines, if there is a
single rack).
This is only true for an embedded app. For an app communicating with the cluster over
Hot Rod, this isn't the case as it could directly read from the minority partition.
For that to happen, the client would have to be able to keep two (or more) active
consistent hashes at the same time. I think, at least in the first phase, the servers in a
minority partition should send a "look somewhere else" response to any request
from the client, so that it installs the topology update of the majority partition and not
the topology of one of the minority partitions.
Tat might work with Hot Rod, but not REST or memcached endpoints.
> One other option is to perform a more complicated post-merge state transfer, in which
each partition sends all the data it has to all the other partitions, and on the receiving
end each node has a "conflict resolution" component that can merge two values.
That is definitely more complicated than just going with a primary partition, though.
That sounds massively expensive. I think the right solution at this point is entry
versioning using vector clocks and the vector clocks are exchanged and compared during a
merge. Not the entire dataset.
True, it would be very expensive. I think for many applications just selecting a winner
should be fine, though, so it might be worth implementing this algorithm with the
versioning support we already have as a POC.
I presume versioning (based on Vector Clocks) is now in master, as a part of Pedro's
work on total order?
> One final point... when a node comes back online and it has a local cache store, it
is very much as if we had a merge view. The current approach is to join as if the node
didn't have any data, then delete everything from the cache store that is not mapped
to the node in the consistent hash. Obviously that can lead to consistency problems, just
like our current merge algorithm. It would be nice if we could handle both these cases the
same way.
+1
I've been thinking about this some more... the problem with the local cache store is
that nodes don't necessary start in the same order in which they were shut down. So
you might have enough nodes for a cluster to consider itself "available", but
only a slight overlap with the cluster as it looked the last time it was
"available" - so you would have stale data.
We might be able to save the topology on shutdown and block startup until the same nodes
that were in the last "available" partition are all up, but it all sounds a bit
fragile.
Yeah, those nodes may never come up again.
- M
--
Manik Surtani
manik(a)jboss.org
twitter.com/maniksurtani
Platform Architect, JBoss Data Grid
http://red.ht/data-grid