[infinispan-dev] ISPN-263 and handling partitions

Dan Berindei dan.berindei at gmail.com
Thu Apr 18 05:44:20 EDT 2013


On Wed, Apr 17, 2013 at 5:53 PM, Manik Surtani <msurtani at redhat.com> wrote:

>
> On 17 Apr 2013, at 08:23, Dan Berindei <dan.berindei at gmail.com> wrote:
>
> I like the idea of always clearing the state in members of the minority
> partition(s), but one problem with that is that there may be some keys that
> only had owners in the minority partition(s). If we wiped the state of the
> minority partition members, those keys would be lost.
>
>
> Right, this is my concern with such a wipe as well.
>
> Of course, you could argue that the cluster already lost those keys when
> we allowed the majority partition to continue working without having those
> keys... We could also rely on the topology information, and say that we
> only support partitioning when numOwners >= numSites (or numRacks, if there
> is only one site, or numMachines, if there is a single rack).
>
>
> This is only true for an embedded app.  For an app communicating with the
> cluster over Hot Rod, this isn't the case as it could directly read from
> the minority partition.
>
>
For that to happen, the client would have to be able to keep two (or more)
active consistent hashes at the same time. I think, at least in the first
phase, the servers in a minority partition should send a "look somewhere
else" response to any request from the client, so that it installs the
topology update of the majority partition and not the topology of one of
the minority partitions.

One other option is to perform a more complicated post-merge state
> transfer, in which each partition sends all the data it has to all the
> other partitions, and on the receiving end each node has a "conflict
> resolution" component that can merge two values. That is definitely more
> complicated than just going with a primary partition, though.
>
>
> That sounds massively expensive.  I think the right solution at this point
> is entry versioning using vector clocks and the vector clocks are exchanged
> and compared during a merge.  Not the entire dataset.
>
>
True, it would be very expensive. I think for many applications just
selecting a winner should be fine, though, so it might be worth
implementing this algorithm with the versioning support we already have as
a POC.

One final point... when a node comes back online and it has a local cache
> store, it is very much as if we had a merge view. The current approach is
> to join as if the node didn't have any data, then delete everything from
> the cache store that is not mapped to the node in the consistent hash.
> Obviously that can lead to consistency problems, just like our current
> merge algorithm. It would be nice if we could handle both these cases the
> same way.
>
>
> +1
>
>

I've been thinking about this some more... the problem with the local cache
store is that nodes don't necessary start in the same order in which they
were shut down. So you might have enough nodes for a cluster to consider
itself "available", but only a slight overlap with the cluster as it looked
the last time it was "available" - so you would have stale data.

We might be able to save the topology on shutdown and block startup until
the same nodes that were in the last "available" partition are all up, but
it all sounds a bit fragile.

Cheers
Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20130418/d6b2c62b/attachment.html 


More information about the infinispan-dev mailing list