]
Dan Berindei closed ISPN-7800.
------------------------------
Resolution: Duplicate Issue
Duplicate of ISPN-5290.
I also created ISPN-9112 to not wait for the initial transfer by default, so the servers
can restart without the timeout error.
Cluster always in Degraded Mode
-------------------------------
Key: ISPN-7800
URL:
https://issues.jboss.org/browse/ISPN-7800
Project: Infinispan
Issue Type: Bug
Affects Versions: 8.2.6.Final, 9.0.0.Final
Reporter: Pedro Ruivo
Scenario:
* 3 nodes, server mode with Partition handling enabled
* 2 nodes are killed and bring back online
* the nodes are unable to merge and the cluster remains in degraded mode.
I suspect that the FORK channel/protocol is the culprit since the heartbeat command is
never handled in the joiner node, but the coordinator receives a {{CacheNotFoundResponse}}
quickly (i.e. without timeout). The request is received and "delivered" but
never reaches Infinispan.
When starting node 1 (logs from coordinator):
{code}
Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
//hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
Responses: value=CacheNotFoundResponse, received=true, suspected=false
Node node01-47572 left while updating cache members
//the view is not handled
{code}
When I started node 2:
{code}
Received new cluster view: 6, isCoordinator = true, old status = COORDINATOR
Updating cluster members for all the caches. New list is [node03-48579, node01-47572,
node02-32959]
//hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
Responses: Responses{
node01-47572: value=SuccessfulResponse{responseValue=true} , received=true,
suspected=false
node02-32959: value=CacheNotFoundResponse, received=true, suspected=false}
Node node02-32959 left while updating cache members
//the view is not handled
{code}
It is always reproducible. The configuration is
{code:xml}
<replicated-cache name="default" mode="SYNC"
batching="true">
<partition-handling enabled="true"/>
<locking isolation="REPEATABLE_READ"/>
<state-transfer enabled="false"/>
{code}