Wolf-Dieter Fink created ISPN-5290:
--------------------------------------
Summary: Better automatic merge for caches with enabled partition handling
Key: ISPN-5290
URL:
https://issues.jboss.org/browse/ISPN-5290
Project: Infinispan
Issue Type: Feature Request
Environment: JDG cluster with partitionHandling enabled
Reporter: Wolf-Dieter Fink
At the moment there is no detection whether a node which join a cluster is one of the
nodes which are known from the "last stable view" or not.
This will have the drawback that the cluster will be still in DEGRADED_MODE if there are
some nodes restarted during the split-brain.
Assuming the cluster split is a power failure of some nodes the available nodes are
DEGRADED as >=numOwners are lost.
If the failed nodes are restarted, let's say we have an application which use library
mode in EAP, these instances are now identified as new nodes as the node-ID's are
different.
If these nodes join the 'cluster' all the nodes are still degraded as the
restarted are now known as different nodes and not as the lost nodes, so the cluster will
not heal and come back to AVAILABLE.
There is a way to prevent some of the possibilities by using server hinting to ensure that
at least one owner will survive.
But there are other cases where it would be good to have a different strategy to get the
cluster back to AVAILABLE mode.
During the split-brain there is no way to continue as there is no possiblity to know
whether "the other" part is gone or still acessable but not seen.
For a shared persistence it might be possible but there is a huge drawback for normal
working state to synchronize that with locking and version columns.
If the node ID can be kept I see the following enhancements:
- with a shared persistence there should no data lost, if all nodes are back in the
cluster it can go AVAILABLE and reload the missing entries
- for a 'side' cache the values are calculated or retrieved from other (slow)
systems, so the cluster can be AVAILABLE and reload the entries
- In other cases there might be a WARNING/ERROR that all members are back from split,
there is maybe some data lost and automaticaly or manually set back to AVAILABLE
It might be complicated to calculate this modes, but a configuration for
partition-handling might give the possibility to the administrator to decide which
behaviour is apropriate for a cache
i.e.
<partition-handling enabled="true" healing="HEALING.MODE"/>
where modes are
AVAILABLE_NO_WARNING back to available after all nodes from "last
stable" are back
AVAILABLE_WARNING_DATALOST dto. but log a warning that some DATA can be lost
WARNING_DATALOST only a warning and a hint how to enable manually
NONE same as current behaviour (if necessary, maybe
WARNING_DATALOST is similar or better)
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)