[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails
Bela Ban (JIRA)
jira-events at lists.jboss.org
Fri Aug 2 09:26:26 EDT 2013
[ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794776#comment-12794776 ]
Bela Ban commented on JGRP-1670:
--------------------------------
This should work with MERGE3 (instead of MERGE2), as it sends physical address info with each INFO message.
> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
> Key: JGRP-1670
> URL: https://issues.jboss.org/browse/JGRP-1670
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.3.1
> Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
> Reporter: Andy Caldwell
> Assignee: Bela Ban
> Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254). This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node. Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address. The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list