[jboss-jira] [JBoss JIRA] Updated: (JGRP-1299) Node does not re-join the cluster after several lost pings

Tue Mar 8 08:07:55 EST 2011

     [ https://issues.jboss.org/browse/JGRP-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor M updated JGRP-1299:
-------------------------

    Description: 
This is what we see in production:

1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them

At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only sees itself.

In the lab we were able to reproduce the problem by stopping Node 1 process:

pstop {PID} ; sleep 35 ; prun {PID}

Once the process is resumed it can never join the cluster.

The first two lines from Node1.log show 26 seconds interval between pings while it should have been 2.5 seconds. 

I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have almost the same effect.

  was:
This is what we see in production:

1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them

At this time Node 1 believe there are two nodes in the cluster, and Node 2 only sees itself.

In the lab we were able to reproduce the problem by stopping Node 1 process:

pstop {PID} ; sleep 35 ; prun {PID}

Once the process is resumed it can never join the cluster.

Here is the log snipped from Node 1. The first two lines show 26 seconds interval between pings while it should have been 2.5 seconds. Node 2 logs for the same time interval are after Node 1 logs

I traced the 26 seconds delay to the GC cycle on Node 1. pstop/sleep/prun have almost the same effect.

> Node does not re-join the cluster after several lost pings
> ----------------------------------------------------------
>
>                 Key: JGRP-1299
>                 URL: https://issues.jboss.org/browse/JGRP-1299
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.15
>         Environment: Solaris OS 10 & Java 1.5 & 1.6
>            Reporter: Igor M
>            Assignee: Bela Ban
>            Priority: Critical
>         Attachments: Node1.log, Node2.log, stacks.xml
>
>
> This is what we see in production:
> 1. Node 1 does not send pings for 25 seconds
> 2. Node 2 notices 6 lost pings (in 15 seconds)
> 3. Node 2 starts sending "broadcast SUSPECT"
> 4. Node 1 replies to a few of them
> 5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
> 6. Node 2 removes Node 1 from the view
> 7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
> At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only sees itself.
> In the lab we were able to reproduce the problem by stopping Node 1 process:
> pstop {PID} ; sleep 35 ; prun {PID}
> Once the process is resumed it can never join the cluster.
> The first two lines from Node1.log show 26 seconds interval between pings while it should have been 2.5 seconds. 
> I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have almost the same effect.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira