[jboss-jira] [JBoss JIRA] Updated: (JGRP-1299) Node does not re-join the cluster after several lost pings

Tuesday, 8 March 2011

     [
https://issues.jboss.org/browse/JGRP-1299?page=com.atlassian.jira.plugin....
]

Igor M updated JGRP-1299:
-------------------------

    Description: 
This is what we see in production:

1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them

At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only sees
itself.

In the lab we were able to reproduce the problem by stopping Node 1 process:

pstop {PID} ; sleep 35 ; prun {PID}

Once the process is resumed it can never join the cluster.

The first two lines from Node1.log show 26 seconds interval between pings while it should
have been 2.5 seconds. 

I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have almost
the same effect.

  was:
This is what we see in production:

1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them

At this time Node 1 believe there are two nodes in the cluster, and Node 2 only sees
itself.

In the lab we were able to reproduce the problem by stopping Node 1 process:

pstop {PID} ; sleep 35 ; prun {PID}

Once the process is resumed it can never join the cluster.

Here is the log snipped from Node 1. The first two lines show 26 seconds interval between
pings while it should have been 2.5 seconds. Node 2 logs for the same time interval are
after Node 1 logs

I traced the 26 seconds delay to the GC cycle on Node 1. pstop/sleep/prun have almost the
same effect.

...
 Node does not re-join the cluster after several lost pings
 ----------------------------------------------------------

                 Key: JGRP-1299
                 URL: https://issues.jboss.org/browse/JGRP-1299
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6.15
         Environment: Solaris OS 10 & Java 1.5 & 1.6
            Reporter: Igor M
            Assignee: Bela Ban
            Priority: Critical
         Attachments: Node1.log, Node2.log, stacks.xml

 This is what we see in production:
 1. Node 1 does not send pings for 25 seconds
 2. Node 2 notices 6 lost pings (in 15 seconds)
 3. Node 2 starts sending "broadcast SUSPECT"
 4. Node 1 replies to a few of them
 5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
 6. Node 2 removes Node 1 from the view
 7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
 At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only
sees itself.
 In the lab we were able to reproduce the problem by stopping Node 1 process:
 pstop {PID} ; sleep 35 ; prun {PID}
 Once the process is resumed it can never join the cluster.
 The first two lines from Node1.log show 26 seconds interval between pings while it should
have been 2.5 seconds. 
 I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have
almost the same effect. 
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Updated: (JGRP-1299) Node does not re-join the cluster after several lost pings