[jboss-jira] [JBoss JIRA] Commented: (JGRP-1299) Node does not re-join the cluster after several lost pings
Igor M (JIRA)
jira-events at lists.jboss.org
Tue Mar 8 09:54:45 EST 2011
[ https://issues.jboss.org/browse/JGRP-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586570#comment-12586570 ]
Igor M commented on JGRP-1299:
------------------------------
Will I have to make any code/config changes?
> Node does not re-join the cluster after several lost pings
> ----------------------------------------------------------
>
> Key: JGRP-1299
> URL: https://issues.jboss.org/browse/JGRP-1299
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.6.15
> Environment: Solaris OS 10 & Java 1.5 & 1.6
> Reporter: Igor M
> Assignee: Bela Ban
> Priority: Critical
> Attachments: Node1.log, Node2.log, stacks.xml
>
>
> This is what we see in production:
> 1. Node 1 does not send pings for 25 seconds
> 2. Node 2 notices 6 lost pings (in 15 seconds)
> 3. Node 2 starts sending "broadcast SUSPECT"
> 4. Node 1 replies to a few of them
> 5. Node 2 does not receive replies until after 15 seconds after it suspected node 1
> 6. Node 2 removes Node 1 from the view
> 7. Node 1 keeps sending "are-you-alive" and Node 2 is now discarding them
> At this time Node 1 believe there are still two nodes in the cluster, and Node 2 only sees itself.
> In the lab we were able to reproduce the problem by stopping Node 1 process:
> pstop {PID} ; sleep 35 ; prun {PID}
> Once the process is resumed it can never join the cluster.
> The first two lines from Node1.log show 26 seconds interval between pings while it should have been 2.5 seconds.
> I traced the 26 seconds delay to the full GC cycle on Node 1. pstop/sleep/prun have almost the same effect.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list