[
https://issues.jboss.org/browse/JGRP-1299?page=com.atlassian.jira.plugin....
]
Igor M commented on JGRP-1299:
------------------------------
I tried your stacks.xml with 2.6.15 and it did not make any difference. Cannot try 2.12
just yet, because it does not seem to be a drop-in replacement for 2.6.15: the cluster
would not initialize.
Node does not re-join the cluster after several lost pings
----------------------------------------------------------
Key: JGRP-1299
URL:
https://issues.jboss.org/browse/JGRP-1299
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.15
Environment: Solaris OS 10 & Java 1.5 & 1.6
Reporter: Igor M
Assignee: Bela Ban
Priority: Critical
Attachments: Node1.log, Node2.log, stacks.xml, stacks.xml
This is what we see in production:
1. Node 1 does not send pings for 25 seconds
2. Node 2 notices 6 lost pings (in 15 seconds)
3. Node 2 starts sending "broadcast SUSPECT"
4. Node 1 replies to a few of them
5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
6. Node 2 removes Node 1 from the view
7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only
sees itself.
In the lab we were able to reproduce the problem by stopping Node 1 process:
pstop {PID} ; sleep 35 ; prun {PID}
Once the process is resumed it can never join the cluster.
The first two lines from Node1.log show 26 seconds interval between pings while it should
have been 2.5 seconds.
I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have
almost the same effect.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira