[infinispan-issues] [JBoss JIRA] (ISPN-7800) Cluster always in Degraded Mode

Mon May 8 11:47:00 EDT 2017

    [ https://issues.jboss.org/browse/ISPN-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403081#comment-13403081 ] 

Dan Berindei commented on ISPN-7800:
------------------------------------

The {{CacheNotFoundResponse}} is expected. Initially I thought there was a problem with the retry logic in the cluster recovery process, but now I think this is a duplicate of ISPN-5290.

The nodes are restarted with different JGroups addresses, so the number of running nodes matching the latest stable cache topology stays the same, and the cache stays in degraded mode.

> Cluster always in Degraded Mode
> -------------------------------
>
>                 Key: ISPN-7800
>                 URL: https://issues.jboss.org/browse/ISPN-7800
>             Project: Infinispan
>          Issue Type: Bug
>    Affects Versions: 8.2.6.Final, 9.0.0.Final
>            Reporter: Pedro Ruivo
>
> Scenario:
> * 3 nodes, server mode with Partition handling enabled
> * 2 nodes are killed and bring back online
> * the nodes are unable to merge and the cluster remains in degraded mode.
> I suspect that the FORK channel/protocol is the culprit since the heartbeat command is never handled in the joiner node, but the coordinator receives a {{CacheNotFoundResponse}} quickly (i.e. without timeout). The request is received and "delivered" but never reaches Infinispan.
> When starting node 1 (logs from coordinator):
> {code}
> Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
> Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
> //hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
> Responses: value=CacheNotFoundResponse, received=true, suspected=false
> Node node01-47572 left while updating cache members
> //the view is not handled
> {code}
> When I started node 2:
> {code}
> Received new cluster view: 6, isCoordinator = true, old status = COORDINATOR
> Updating cluster members for all the caches. New list is [node03-48579, node01-47572, node02-32959]
> //hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
> Responses: Responses{
>   node01-47572: value=SuccessfulResponse{responseValue=true} , received=true, suspected=false
>   node02-32959: value=CacheNotFoundResponse, received=true, suspected=false}
> Node node02-32959 left while updating cache members
> //the view is not handled
> {code}
> It is always reproducible. The configuration is
> {code:xml}
> <replicated-cache name="default" mode="SYNC" batching="true">
>   <partition-handling enabled="true"/>
>   <locking isolation="REPEATABLE_READ"/>
> <state-transfer enabled="false"/>
> {code}

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)