[infinispan-dev] design for cluster events (wiki page)

Fri Nov 1 07:03:50 EDT 2013

On Nov 1, 2013, at 7:41 AM, Bela Ban <bban at redhat.com> wrote:

> 
> 
> On 10/31/13 10:50 PM, Erik Salter wrote:
>> Thanks for the wiki and the discussion.  Since I'm encountering this in the
>> wild, allow me to offer my thoughts.
>> 
>> Basically my feeling is that no matter what you do in split-brain handling,
>> it's going to be wrong.
>> In my use case, I have individual blades where each blade runs a suite of
>> application nodes; one of which is a data grid node.  Each node is
>> single-homed.  And they wire into the same switch.  This setup is orthogonal
>> across a data center (WAN).  In this deployment, these two DCs make up a
>> single cluster.  There is a concept of a set of keys for my caches being
>> "owned" by a site, i.e. only one set of clients will access these keys.
>> These keys are striped across the WAN with a TACH.
>> 
>> So a split brain on a local data center only can occur when a NIC on one of
>> the blades goes bad and the node is still running.  The merge will always be
>> of the [subgroups=N-1, 1] variety, where N is the number of running nodes in
>> the cluster.  Since these nodes are single-homed, they cannot receive
>> requests if they are "offline" from the NIC.  I don't have to worry about
>> state collision, but I DO have to worry about stale state from the merged
>> node.
> 
> In my experience, partitions are almost never caused by malfunctioning 
> hardware, but by GC pauses, high CPU spikes and other blocking behavior 
> which causes FD/FD_ALL to falsely suspect a node.
> 
>> In this case, it's easy to tell when I might be in a split-brain.  The FD
>> protocol will suspect and exclude a node.  Currently, though, I have no way
>> of knowing how or why a node was excluded.
> 
> We *could* detect graceful leaves... this would narrow the exclusion 
> cases to crashed and partitioned nodes.
> 
>> If the WAN goes down, I have a rather large problem.First off is
>> detection.  If there's an ACL blockage, or worse, a unidirectional outage
>> (i.e. east can see west, but not vice-versa), it takes the cluster a minute
>> (really, about 60 seconds) to figure things out.
> 
> 1 minute because FD/FD_ALL is configured to 60s correct ?
> 
> I would definitely *not* lower this threshold, as we don't want entire 
> sites to be falsely suspected, only to later get merged back.
> 
>> One side will have spurious MERGEs, the other side will have leaves from either FD_SOCK or FD.
> 
> You're not referring to xsite here, are you ? This is your striped 
> architecture, where you have a set of TCP-based stripes (clusters) 
> *across* sites, right ?
> 
> If so, having a cluster spread across sites is challenging, to say the 
> least. The risk of partitions is greater than for a purely local 
> cluster, as the WAN increases the risk of intermediate switches 
> crashing, failure detection messages to get lost or delayed by high 
> latency etc.

Erik, why not using using xsite in this deployment? Lack of state transfer perhaps?

Cheers,
-- 
Mircea Markus
Infinispan lead (www.infinispan.org)