[JBoss JIRA] (ISPN-5290) Better automatic merge for caches with enabled partition handling

Wednesday, 11 March 2015

Wolf-Dieter Fink created ISPN-5290:
--------------------------------------

             Summary: Better automatic merge for caches with enabled partition handling
                 Key: ISPN-5290
                 URL: https://issues.jboss.org/browse/ISPN-5290
             Project: Infinispan
          Issue Type: Feature Request
         Environment: JDG cluster with partitionHandling enabled
            Reporter: Wolf-Dieter Fink

At the moment there is no detection whether a node which join a cluster is one of the
nodes which are known from the "last stable view" or not.
This will have the drawback that the cluster will be still in DEGRADED_MODE if there are
some nodes restarted during the split-brain.

Assuming the cluster split is a power failure of some nodes the available nodes are
DEGRADED as >=numOwners are lost.
If the failed nodes are restarted, let's say we have an application which use library
mode in EAP, these instances are now identified as new nodes as the node-ID's are
different.
If these nodes join the 'cluster' all the nodes are still degraded as the
restarted are now known as different nodes and not as the lost nodes, so the cluster will
not heal and come back to AVAILABLE.

There is a way to prevent some of the possibilities by using server hinting to ensure that
at least one owner will survive.
But there are other cases where it would be good to have a different strategy to get the
cluster back to AVAILABLE mode.

During the split-brain there is no way to continue as there is no possiblity to know
whether "the other" part is gone or still acessable but not seen.
For a shared persistence it might be possible but there is a huge drawback for normal
working state to synchronize that with locking and version columns.

If the node ID can be kept I see the following enhancements:

- with a shared persistence there should no data lost, if all nodes are back in the
cluster it can go AVAILABLE and reload the missing entries
- for a 'side' cache the values are calculated or retrieved from other (slow)
systems, so the cluster can be AVAILABLE and reload the entries
- In other cases there might be a WARNING/ERROR that all members are back from split,
there is maybe some data lost and automaticaly or manually set back to AVAILABLE

It might be complicated to calculate this modes, but a configuration for
partition-handling might give the possibility to the administrator to decide which
behaviour is apropriate for a cache

i.e.
<partition-handling enabled="true" healing="HEALING.MODE"/>
where modes are
AVAILABLE_NO_WARNING          back to available after all nodes from "last
stable" are back
AVAILABLE_WARNING_DATALOST    dto. but log a warning that some DATA can be lost
WARNING_DATALOST              only a warning and a hint how to enable manually
NONE                          same as current behaviour (if necessary, maybe
WARNING_DATALOST is similar or better)

--
This message was sent by Atlassian JIRA
(v6.3.11#6341)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009