[infinispan-issues] [JBoss JIRA] (ISPN-7800) Cluster always in Degraded Mode

Thu May 4 10:14:00 EDT 2017

     [ https://issues.jboss.org/browse/ISPN-7800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pedro Ruivo updated ISPN-7800:
------------------------------
    Forum Reference: https://developer.jboss.org/message/971397#971397


> Cluster always in Degraded Mode
> -------------------------------
>
>                 Key: ISPN-7800
>                 URL: https://issues.jboss.org/browse/ISPN-7800
>             Project: Infinispan
>          Issue Type: Bug
>    Affects Versions: 8.2.6.Final, 9.0.0.Final
>            Reporter: Pedro Ruivo
>
> Scenario:
> * 3 nodes, server mode with Partition handling enabled
> * 2 nodes are killed and bring back online
> * the nodes are unable to merge and the cluster remains in degraded mode.
> I suspect that the FORK channel/protocol is the culprit since the heartbeat command is never handled in the joiner node, but the coordinator receives a {{CacheNotFoundResponse}} quickly (i.e. without timeout). The request is received and "delivered" but never reaches Infinispan.
> When starting node 1 (logs from coordinator):
> {code}
> Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
> Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
> //hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
> Responses: value=CacheNotFoundResponse, received=true, suspected=false
> Node node01-47572 left while updating cache members
> //the view is not handled
> {code}
> When I started node 2:
> {code}
> Received new cluster view: 6, isCoordinator = true, old status = COORDINATOR
> Updating cluster members for all the caches. New list is [node03-48579, node01-47572, node02-32959]
> //hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
> Responses: Responses{
>   node01-47572: value=SuccessfulResponse{responseValue=true} , received=true, suspected=false
>   node02-32959: value=CacheNotFoundResponse, received=true, suspected=false}
> Node node02-32959 left while updating cache members
> //the view is not handled
> {code}
> It is always reproducible. The configuration is
> {code:xml}
> <replicated-cache name="default" mode="SYNC" batching="true">
>   <partition-handling enabled="true"/>
>   <locking isolation="REPEATABLE_READ"/>
> <state-transfer enabled="false"/>
> {code}


--
This message was sent by Atlassian JIRA
(v7.2.3#72005)