[infinispan-issues] [JBoss JIRA] (ISPN-8713) "Initial state transfer timed out" in border case

Thu Dec 13 06:10:02 EST 2018

     [ https://issues.jboss.org/browse/ISPN-8713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Walter Pongratz closed ISPN-8713.
---------------------------------
    Resolution: Done


Just retestet it - in Version  9.4.4.Final this issue does not occure anymore

> "Initial state transfer timed out" in border case
> -------------------------------------------------
>
>                 Key: ISPN-8713
>                 URL: https://issues.jboss.org/browse/ISPN-8713
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 9.1.4.Final
>            Reporter: Walter Pongratz
>            Priority: Critical
>
> There is a bug in Infinispan 9.1.4 in a cluster where not every cache ist started on each node. For detailed steps to reproduce see below, short summary: If a cache is not started on the coordinator and the last node is not leaving gracefully that cache can not be started again until the coordinator changes.
> Known aspects:
> * In Infinispan 8.2.6 the same steps do NOT lead to an issue
> Likely cause: On a non-graceful exit of the last node with a certain cache the cache is not removed from the cacheStatusMap in ClusterTopologyManagerImpl on the coordinator, the ClusterCacheStatus is only manipulated - See TODO in ClusterCacheStatus.updateCurrentTopology(). On a graceful exit this is done - see ClusterTopologyManagerImpl.handleLeave().
> Now somehow a new Node starting the Cache in question waits for the initial state transfer - which never happends because there is no other node with this cache. In Infinispan 8.2.6. this seemed not to be a problem - but in 9.1.4 it is. The fix then would be to either fix this Todo and remove the ClusterCacheStatus from the map OR fix that the new node is not waiting for initial state transfer in this case.
> I set the priority to critical because of the difficulty in fixing this in a production envrionment: Once this problem happens the cache can not be started ON ANY NODE OF THE CLUSTER until the coordinator is changed. If a new Node does not start operations personel would assume a problem with that node and try to restart it.


--
This message was sent by Atlassian Jira
(v7.12.1#712002)