[JBoss JIRA] (ISPN-1814) CacheViewsManagerImpl enters an infinite loop if a joining node is killed before installing the initial view

Monday, 30 January 2012

Dan Berindei created ISPN-1814:
----------------------------------

             Summary: CacheViewsManagerImpl enters an infinite loop if a joining node is
killed before installing the initial view
                 Key: ISPN-1814
                 URL: https://issues.jboss.org/browse/ISPN-1814
             Project: Infinispan
          Issue Type: Bug
          Components: State transfer
    Affects Versions: 5.1.0.FINAL
            Reporter: Dan Berindei
            Assignee: Dan Berindei
             Fix For: 5.1.1.CR1, 5.1.1.FINAL, 5.2.0.FINAL
         Attachments:
short_org.jboss.as.test.clustering.unmanaged.singleton.SingletonTestCase-output.txt

When a node leaves the cluster gracefully, it is automatically removed from the set of
joiners and from the next cache view. If, however, it leaves without sending a
{{CacheViewControlCommand\{REQUEST_LEAVE\}}}, it is only handled properly if it's part
of the last committed view.

This is visible in the attached log (a simplified version of
https://issues.jboss.org/secure/attachment/12350962/org.jboss.as.test.clu...,
from ISPN-1806)

The test repeatedly kills a node (node-udp-1) and starts it up again. Somehow JGroups
didn't detect the killed node and when it was restarted we received a 3-node view:

{noformat}
20:22:49,552 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport]
(Incoming-13,null) ISPN000094: Received new cluster view: [node-udp-0/cluster|4]
[node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster]
{noformat}

CacheViewsManagerImpl tried to install the new view, but obviously the killed node
wasn't responding. I believe we only got an exception when the test timed out and it
stopped the new {{node-udp-1}}:

{noformat}
20:23:20,170 ERROR [org.infinispan.cacheviews.CacheViewsManagerImpl]
(CacheViewInstaller-1,node-udp-0/cluster) ISPN000172: Failed to prepare view
CacheView{viewId=6, members=[node-udp-0/cluster, node-udp-1/cluster]} for cache  default,
rolling back to view CacheView{viewId=5, members=[node-udp-0/cluster]}:
java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member:
node-udp-1/cluster
{noformat}

However, because of the bug in CacheViewsManagerImpl, we kept trying to install a cache
view with 3 nodes:

{noformat}
20:23:20,226 ERROR [org.infinispan.cacheviews.CacheViewsManagerImpl]
(CacheViewInstaller-1,node-udp-0/cluster) ISPN000172: Failed to prepare view
CacheView{viewId=8, members=[node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster]}
for cache  default, rolling back to view CacheView{viewId=7,
members=[node-udp-0/cluster]}: java.util.concurrent.ExecutionException:
org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left
the cluster while replicating command CacheViewControlCommand{cache=default,
type=PREPARE_VIEW, sender=node-udp-0/cluster, newViewId=8, newMembers=[node-udp-0/cluster,
node-udp-1/cluster, node-udp-1/cluster], oldViewId=7, oldMembers=[node-udp-0/cluster]}
{noformat}

The test couldn't really stop because it was blocked waiting for a transaction commit,
and the commit command was waiting for the cache view installation to end.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009