]
Dan Berindei commented on ISPN-7800:
------------------------------------
The {{CacheNotFoundResponse}} is expected. Initially I thought there was a problem with
the retry logic in the cluster recovery process, but now I think this is a duplicate of
ISPN-5290.
The nodes are restarted with different JGroups addresses, so the number of running nodes
matching the latest stable cache topology stays the same, and the cache stays in degraded
mode.
Cluster always in Degraded Mode
-------------------------------
Key: ISPN-7800
URL:
https://issues.jboss.org/browse/ISPN-7800
Project: Infinispan
Issue Type: Bug
Affects Versions: 8.2.6.Final, 9.0.0.Final
Reporter: Pedro Ruivo
Scenario:
* 3 nodes, server mode with Partition handling enabled
* 2 nodes are killed and bring back online
* the nodes are unable to merge and the cluster remains in degraded mode.
I suspect that the FORK channel/protocol is the culprit since the heartbeat command is
never handled in the joiner node, but the coordinator receives a {{CacheNotFoundResponse}}
quickly (i.e. without timeout). The request is received and "delivered" but
never reaches Infinispan.
When starting node 1 (logs from coordinator):
{code}
Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
//hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
Responses: value=CacheNotFoundResponse, received=true, suspected=false
Node node01-47572 left while updating cache members
//the view is not handled
{code}
When I started node 2:
{code}
Received new cluster view: 6, isCoordinator = true, old status = COORDINATOR
Updating cluster members for all the caches. New list is [node03-48579, node01-47572,
node02-32959]
//hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
Responses: Responses{
node01-47572: value=SuccessfulResponse{responseValue=true} , received=true,
suspected=false
node02-32959: value=CacheNotFoundResponse, received=true, suspected=false}
Node node02-32959 left while updating cache members
//the view is not handled
{code}
It is always reproducible. The configuration is
{code:xml}
<replicated-cache name="default" mode="SYNC"
batching="true">
<partition-handling enabled="true"/>
<locking isolation="REPEATABLE_READ"/>
<state-transfer enabled="false"/>
{code}