[JBoss JIRA] (ISPN-7800) Cluster always in Degraded Mode

Monday, 8 May 2017

    [
https://issues.jboss.org/browse/ISPN-7800?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-7800:
------------------------------------

The {{CacheNotFoundResponse}} is expected. Initially I thought there was a problem with
the retry logic in the cluster recovery process, but now I think this is a duplicate of
ISPN-5290.

The nodes are restarted with different JGroups addresses, so the number of running nodes
matching the latest stable cache topology stays the same, and the cache stays in degraded
mode.

...
 Cluster always in Degraded Mode
 -------------------------------

                 Key: ISPN-7800
                 URL: https://issues.jboss.org/browse/ISPN-7800
             Project: Infinispan
          Issue Type: Bug
    Affects Versions: 8.2.6.Final, 9.0.0.Final
            Reporter: Pedro Ruivo

 Scenario:
 * 3 nodes, server mode with Partition handling enabled
 * 2 nodes are killed and bring back online
 * the nodes are unable to merge and the cluster remains in degraded mode.
 I suspect that the FORK channel/protocol is the culprit since the heartbeat command is
never handled in the joiner node, but the coordinator receives a {{CacheNotFoundResponse}}
quickly (i.e. without timeout). The request is received and "delivered" but
never reaches Infinispan.
 When starting node 1 (logs from coordinator):
 {code}
 Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
 Received new cluster view: 5, isCoordinator = true, old status = COORDINATOR
 //hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
 Responses: value=CacheNotFoundResponse, received=true, suspected=false
 Node node01-47572 left while updating cache members
 //the view is not handled
 {code}
 When I started node 2:
 {code}
 Received new cluster view: 6, isCoordinator = true, old status = COORDINATOR
 Updating cluster members for all the caches. New list is [node03-48579, node01-47572,
node02-32959]
 //hearbeat sent, ClusterTopologyManagerImpl.confirmMembersAvailable();
 Responses: Responses{
   node01-47572: value=SuccessfulResponse{responseValue=true} , received=true,
suspected=false
   node02-32959: value=CacheNotFoundResponse, received=true, suspected=false}
 Node node02-32959 left while updating cache members
 //the view is not handled
 {code}
 It is always reproducible. The configuration is
 {code:xml}
 <replicated-cache name="default" mode="SYNC"
batching="true">
   <partition-handling enabled="true"/>
   <locking isolation="REPEATABLE_READ"/>
 <state-transfer enabled="false"/>
 {code} 

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009