[infinispan-issues] [JBoss JIRA] (ISPN-7800) Cluster always in Degraded Mode

Thu May 4 10:11:00 EDT 2017

Pedro Ruivo created ISPN-7800:
---------------------------------

             Summary: Cluster always in Degraded Mode
                 Key: ISPN-7800
                 URL: https://issues.jboss.org/browse/ISPN-7800
             Project: Infinispan
          Issue Type: Bug
    Affects Versions: 9.0.0.Final, 8.2.6.Final
            Reporter: Pedro Ruivo

Scenario:

* 3 nodes, server mode with Partition handling enabled
* 2 nodes are killed and bring back online
* the nodes are unable to merge and the cluster remains in degraded mode.

I suspect that the FORK channel/protocol is the culprit since the heartbeat command is never handled in the joiner node, but the coordinator receives a {{CacheNotFoundResponse}} quickly (i.e. without timeout). The request is received and "delivered" but never reaches Infinispan.

When starting node 1 (logs from coordinator):

{code}
Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
//hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
Responses: value=CacheNotFoundResponse, received=true, suspected=false
Node node01-47572 left while updating cache members
//the view is not handled
{code}

When I started node 2:

{code}
Received new cluster view: 6, isCoordinator = true, old status = COORDINATOR
Updating cluster members for all the caches. New list is [node03-48579, node01-47572, node02-32959]
//hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
Responses: Responses{
  node01-47572: value=SuccessfulResponse{responseValue=true} , received=true, suspected=false
  node02-32959: value=CacheNotFoundResponse, received=true, suspected=false}
Node node02-32959 left while updating cache members
//the view is not handled
{code}

It is always reproducible. The configuration is
{code:xml}
<replicated-cache name="default" mode="SYNC" batching="true">
  <partition-handling enabled="true"/>
  <locking isolation="REPEATABLE_READ"/>
<state-transfer enabled="false"/>
{code}

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)