[
https://issues.jboss.org/browse/JGRP-1299?page=com.atlassian.jira.plugin....
]
Igor M updated JGRP-1299:
-------------------------
Description:
This is what we see in production:
1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only sees
itself.
In the lab we were able to reproduce the problem by stopping Node 1 process:
pstop {PID} ; sleep 35 ; prun {PID}
Once the process is resumed it can never join the cluster.
The first two lines from Node1.log show 26 seconds interval between pings while it should
have been 2.5 seconds.
I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have almost
the same effect.
was:
This is what we see in production:
1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
At this time Node 1 believe there are two nodes in the cluster, and Node 2 only sees
itself.
In the lab we were able to reproduce the problem by stopping Node 1 process:
pstop {PID} ; sleep 35 ; prun {PID}
Once the process is resumed it can never join the cluster.
Here is the log snipped from Node 1. The first two lines show 26 seconds interval between
pings while it should have been 2.5 seconds. Node 2 logs for the same time interval are
after Node 1 logs
I traced the 26 seconds delay to the GC cycle on Node 1. pstop/sleep/prun have almost the
same effect.
Node does not re-join the cluster after several lost pings
----------------------------------------------------------
Key: JGRP-1299
URL:
https://issues.jboss.org/browse/JGRP-1299
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.15
Environment: Solaris OS 10 & Java 1.5 & 1.6
Reporter: Igor M
Assignee: Bela Ban
Priority: Critical
Attachments: Node1.log, Node2.log, stacks.xml
This is what we see in production:
1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only
sees itself.
In the lab we were able to reproduce the problem by stopping Node 1 process:
pstop {PID} ; sleep 35 ; prun {PID}
Once the process is resumed it can never join the cluster.
The first two lines from Node1.log show 26 seconds interval between pings while it should
have been 2.5 seconds.
I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have
almost the same effect.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira