Re: [infinispan-dev] design for cluster events (wiki page)

Friday, 1 November 2013

On Nov 1, 2013, at 7:41 AM, Bela Ban <bban(a)redhat.com&gt; wrote:

...

 On 10/31/13 10:50 PM, Erik Salter wrote:
> Thanks for the wiki and the discussion.  Since I'm encountering this in the
> wild, allow me to offer my thoughts.
> 
> Basically my feeling is that no matter what you do in split-brain handling,
> it's going to be wrong.
> In my use case, I have individual blades where each blade runs a suite of
> application nodes; one of which is a data grid node.  Each node is
> single-homed.  And they wire into the same switch.  This setup is orthogonal
> across a data center (WAN).  In this deployment, these two DCs make up a
> single cluster.  There is a concept of a set of keys for my caches being
> "owned" by a site, i.e. only one set of clients will access these keys.
> These keys are striped across the WAN with a TACH.
> 
> So a split brain on a local data center only can occur when a NIC on one of
> the blades goes bad and the node is still running.  The merge will always be
> of the [subgroups=N-1, 1] variety, where N is the number of running nodes in
> the cluster.  Since these nodes are single-homed, they cannot receive
> requests if they are "offline" from the NIC.  I don't have to worry
about
> state collision, but I DO have to worry about stale state from the merged
> node.

 In my experience, partitions are almost never caused by malfunctioning 
 hardware, but by GC pauses, high CPU spikes and other blocking behavior 
 which causes FD/FD_ALL to falsely suspect a node.

> In this case, it's easy to tell when I might be in a split-brain.  The FD
> protocol will suspect and exclude a node.  Currently, though, I have no way
> of knowing how or why a node was excluded.

 We *could* detect graceful leaves... this would narrow the exclusion 
 cases to crashed and partitioned nodes.

> If the WAN goes down, I have a rather large problem.First off is
> detection.  If there's an ACL blockage, or worse, a unidirectional outage
> (i.e. east can see west, but not vice-versa), it takes the cluster a minute
> (really, about 60 seconds) to figure things out.

 1 minute because FD/FD_ALL is configured to 60s correct ?

 I would definitely *not* lower this threshold, as we don't want entire 
 sites to be falsely suspected, only to later get merged back.

> One side will have spurious MERGEs, the other side will have leaves from either
FD_SOCK or FD.

 You're not referring to xsite here, are you ? This is your striped 
 architecture, where you have a set of TCP-based stripes (clusters) 
 *across* sites, right ?

 If so, having a cluster spread across sites is challenging, to say the 
 least. The risk of partitions is greater than for a purely local 
 cluster, as the WAN increases the risk of intermediate switches 
 crashing, failure detection messages to get lost or delayed by high 
 latency etc. 
Erik, why not using using xsite in this deployment? Lack of state transfer perhaps?

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [infinispan-dev] design for cluster events (wiki page)