[jboss-jira] [JBoss JIRA] Updated: (JGRP-1299) Node does not re-join the cluster after several lost pings

Tuesday, 8 March 2011

     [
https://issues.jboss.org/browse/JGRP-1299?page=com.atlassian.jira.plugin....
]

Igor M updated JGRP-1299:
-------------------------

    Steps to Reproduce: 
1. Start two apps and let them form a cluster
2. Run this on one of the nodes: pstop {PID} ; sleep 35 ; prun {PID}. I think in
production, this pause happened on the node, which was not a jgroups coordinator.
3. Watch jgroups logs

The sleep timeout has to be set to exceed the time it takes to have a node suspected PLUS
the time on the SUSPECT list. In our case it is 15 seconds each

  was:
1. Start two apps and let them form a cluster
2. Run this on one of the nodes: pstop {PID} ; sleep 35 ; prun {PID}
3. Watch jgroups logs

The sleep timeout has to be set to exceed the time it takes to have a node suspected PLUS
the time on the SUSPECT list. In our case it is 15 seconds each

...
 Node does not re-join the cluster after several lost pings
 ----------------------------------------------------------

                 Key: JGRP-1299
                 URL: https://issues.jboss.org/browse/JGRP-1299
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6.15
         Environment: Solaris OS 10 & Java 1.5 & 1.6
            Reporter: Igor M
            Assignee: Bela Ban
            Priority: Critical
         Attachments: Node1.log, Node2.log, stacks.xml

 This is what we see in production:
 1. Node 1 does not send pings for 25 seconds
 2. Node 2 notices 6 lost pings (in 15 seconds)
 3. Node 2 starts sending "broadcast SUSPECT"
 4. Node 1 replies to a few of them
 5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
 6. Node 2 removes Node 1 from the view
 7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
 At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only
sees itself.
 In the lab we were able to reproduce the problem by stopping Node 1 process:
 pstop {PID} ; sleep 35 ; prun {PID}
 Once the process is resumed it can never join the cluster.
 The first two lines from Node1.log show 26 seconds interval between pings while it should
have been 2.5 seconds. 
 I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have
almost the same effect. 
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Updated: (JGRP-1299) Node does not re-join the cluster after several lost pings