[infinispan-dev] design for cluster events (wiki page)
Mircea Markus
mmarkus at redhat.com
Fri Nov 1 07:03:50 EDT 2013
On Nov 1, 2013, at 7:41 AM, Bela Ban <bban at redhat.com> wrote:
>
>
> On 10/31/13 10:50 PM, Erik Salter wrote:
>> Thanks for the wiki and the discussion. Since I'm encountering this in the
>> wild, allow me to offer my thoughts.
>>
>> Basically my feeling is that no matter what you do in split-brain handling,
>> it's going to be wrong.
>> In my use case, I have individual blades where each blade runs a suite of
>> application nodes; one of which is a data grid node. Each node is
>> single-homed. And they wire into the same switch. This setup is orthogonal
>> across a data center (WAN). In this deployment, these two DCs make up a
>> single cluster. There is a concept of a set of keys for my caches being
>> "owned" by a site, i.e. only one set of clients will access these keys.
>> These keys are striped across the WAN with a TACH.
>>
>> So a split brain on a local data center only can occur when a NIC on one of
>> the blades goes bad and the node is still running. The merge will always be
>> of the [subgroups=N-1, 1] variety, where N is the number of running nodes in
>> the cluster. Since these nodes are single-homed, they cannot receive
>> requests if they are "offline" from the NIC. I don't have to worry about
>> state collision, but I DO have to worry about stale state from the merged
>> node.
>
> In my experience, partitions are almost never caused by malfunctioning
> hardware, but by GC pauses, high CPU spikes and other blocking behavior
> which causes FD/FD_ALL to falsely suspect a node.
>
>> In this case, it's easy to tell when I might be in a split-brain. The FD
>> protocol will suspect and exclude a node. Currently, though, I have no way
>> of knowing how or why a node was excluded.
>
> We *could* detect graceful leaves... this would narrow the exclusion
> cases to crashed and partitioned nodes.
>
>> If the WAN goes down, I have a rather large problem.First off is
>> detection. If there's an ACL blockage, or worse, a unidirectional outage
>> (i.e. east can see west, but not vice-versa), it takes the cluster a minute
>> (really, about 60 seconds) to figure things out.
>
> 1 minute because FD/FD_ALL is configured to 60s correct ?
>
> I would definitely *not* lower this threshold, as we don't want entire
> sites to be falsely suspected, only to later get merged back.
>
>> One side will have spurious MERGEs, the other side will have leaves from either FD_SOCK or FD.
>
> You're not referring to xsite here, are you ? This is your striped
> architecture, where you have a set of TCP-based stripes (clusters)
> *across* sites, right ?
>
> If so, having a cluster spread across sites is challenging, to say the
> least. The risk of partitions is greater than for a purely local
> cluster, as the WAN increases the risk of intermediate switches
> crashing, failure detection messages to get lost or delayed by high
> latency etc.
Erik, why not using using xsite in this deployment? Lack of state transfer perhaps?
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
More information about the infinispan-dev
mailing list