[infinispan-issues] [JBoss JIRA] (ISPN-8713) "Initial state transfer timed out" in border case

Mon Jan 22 06:25:00 EST 2018

Walter Pongratz created ISPN-8713:
-------------------------------------

             Summary: "Initial state transfer timed out" in border case
                 Key: ISPN-8713
                 URL: https://issues.jboss.org/browse/ISPN-8713
             Project: Infinispan
          Issue Type: Bug
          Components: Core
    Affects Versions: 9.1.4.Final
            Reporter: Walter Pongratz
            Priority: Critical

There is a bug in Infinispan 9.1.4 in a cluster where not every cache ist started on each node. For detailed steps to reproduce see below, short summary: If a cache is not started on the coordinator and the last node is not leaving gracefully that cache can not be started again until the coordinator changes.

Known aspects:
* In Infinispan 8.2.6 the same steps do NOT lead to an issue

Likely cause: On a non-graceful exit of the last node with a certain cache the cache is not removed from the cacheStatusMap in ClusterTopologyManagerImpl on the coordinator, the ClusterCacheStatus is only manipulated - See TODO in ClusterCacheStatus.updateCurrentTopology(). On a graceful exit this is done - see ClusterTopologyManagerImpl.handleLeave().

Now somehow a new Node starting the Cache in question waits for the initial state transfer - which never happends because there is no other node with this cache. In Infinispan 8.2.6. this seemed not to be a problem - but in 9.1.4 it is. The fix then would be to either fix this Todo and remove the ClusterCacheStatus from the map OR fix that the new node is not waiting for initial state transfer in this case.

I set the priority to critical because of the difficulty in fixing this in a production envrionment: Once this problem happens the cache can not be started ON ANY NODE OF THE CLUSTER until the coordinator is changed. If a new Node does not start operations personel would assume a problem with that node and try to restart it.

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)