[infinispan-dev] design for cluster events (wiki page)

Fri Nov 1 03:41:22 EDT 2013

On 10/31/13 10:50 PM, Erik Salter wrote:
> Thanks for the wiki and the discussion.  Since I'm encountering this in the
> wild, allow me to offer my thoughts.
>
> Basically my feeling is that no matter what you do in split-brain handling,
> it's going to be wrong.
> In my use case, I have individual blades where each blade runs a suite of
> application nodes; one of which is a data grid node.  Each node is
> single-homed.  And they wire into the same switch.  This setup is orthogonal
> across a data center (WAN).  In this deployment, these two DCs make up a
> single cluster.  There is a concept of a set of keys for my caches being
> "owned" by a site, i.e. only one set of clients will access these keys.
> These keys are striped across the WAN with a TACH.
>
> So a split brain on a local data center only can occur when a NIC on one of
> the blades goes bad and the node is still running.  The merge will always be
> of the [subgroups=N-1, 1] variety, where N is the number of running nodes in
> the cluster.  Since these nodes are single-homed, they cannot receive
> requests if they are "offline" from the NIC.  I don't have to worry about
> state collision, but I DO have to worry about stale state from the merged
> node.

In my experience, partitions are almost never caused by malfunctioning 
hardware, but by GC pauses, high CPU spikes and other blocking behavior 
which causes FD/FD_ALL to falsely suspect a node.

> In this case, it's easy to tell when I might be in a split-brain.  The FD
> protocol will suspect and exclude a node.  Currently, though, I have no way
> of knowing how or why a node was excluded.

We *could* detect graceful leaves... this would narrow the exclusion 
cases to crashed and partitioned nodes.

> If the WAN goes down, I have a rather large problem.First off is
> detection.  If there's an ACL blockage, or worse, a unidirectional outage
> (i.e. east can see west, but not vice-versa), it takes the cluster a minute
> (really, about 60 seconds) to figure things out.

1 minute because FD/FD_ALL is configured to 60s correct ?

I would definitely *not* lower this threshold, as we don't want entire 
sites to be falsely suspected, only to later get merged back.

>  One side will have spurious MERGEs, the other side will have leaves from either FD_SOCK or FD.

You're not referring to xsite here, are you ? This is your striped 
architecture, where you have a set of TCP-based stripes (clusters) 
*across* sites, right ?

If so, having a cluster spread across sites is challenging, to say the 
least. The risk of partitions is greater than for a purely local 
cluster, as the WAN increases the risk of intermediate switches 
crashing, failure detection messages to get lost or delayed by high 
latency etc.

> So again, there's a problem of detection.  It's a pretty safe bet, though,
> that if I don't see nodes from the other data center, I'm in a split brain.
> But I still can't differentiate between a normal leave, a leave caused by a
> socket error, or a leave caused by an FD exclusion.

You *could* detect the first case. Yes, you can't detect the other two 
cases.

> But if a WAN goes down, both sides are available to their respective clients.

OK

> Since my use case is such that the keys are unique to each side,  all the
> CRUD operations of a key come from a client on one side of the WAN, even
> though the primary owner of that key may be on the other side.  Ex: East and
> West are a single cluster.  East makes modifications to Key K that has its
> primary owner in West (suboptimal, but it works).  The WAN splits apart.
> Clients from East continue to make modifications to K.  If clients fail over
> to West, they still are the only ones making modifications to K.

But the modifications made in East to K won't make it to West...

> The real problem occurs when the WAN event is over and the clusters merge.
> The merge handling has no way of knowing where to source K from.
>
> So I think we're missing vector clocking.

I agree: eventual consistency would be a help here. Your TACH based 
consistent hash would guarantee that each key is present at least once 
in each site, so we won't have the issue of non existing keys. However, 
as you mention, after the partition heals, you need a way to resolve 
conflicts.

> And taking it back a step, this doesn't have to be a WAN.  The use case is:
> split brain, but clients can only write to one side.  I would tend to think
> that's pretty common.  This can be solved with vector clocking (or any other
> versioning scheme).

But: you lose strict consistency (the system is AP). Plus, the app has 
to have code to merge conflicts that the system cannot automatically 
resolve.

> In my use case, it would be "pick the latest".  I
> mentioned vector clocking because if you detected a split brain event at
> time T, and you were trying to merge keys with modifications after this
> time, this would be some application-level interface that would need to be
> called.
>
> When we talk about controlling availability, I tend to think users really
> care about when the event started and how to detect it.   Availability is a
> "damned if you do, damned if you don't" scenario, due to the complex
> networking scenarios you can envision.  From my perspective, I really care
> about "how do I detect it?"  Then it becomes "how do I recover?"
>
> Just some disparate thoughts,
>
> Erik
>
> P.S.  (Yes, I know about XSite.  It has some limitations that necessitated
> the architecture above)
>
>
> -----Original Message-----
> From: infinispan-dev-bounces at lists.jboss.org
> [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Mircea Markus
> Sent: Wednesday, October 30, 2013 3:05 PM
> To: infinispan -Dev List
> Subject: [infinispan-dev] design for cluster events (wiki page)
>
> Hi,
>
> I've added a wiki page[1] capturing our discussions around cluster events.
> Any feedback welcomed!
>
> [1]
> https://github.com/infinispan/infinispan/wiki/Handling-cluster-partitions
>
> Cheers,
>

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)