[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails

Fri Aug 2 10:55:26 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794821#comment-12794821 ] 

Andy Caldwell commented on JGRP-1670:
-------------------------------------

An alternative to my original fix (though more complex) is to add the physical address information to discovery requests sent to nodes that are in TCPPING.initial_hosts or retrieved from PDC, but which aren't in the TCPPING.dynamic_hosts (or aren't in the current view, I'm not sure which is better).

This would prevent the larger messages being sent within a stable cluster, but would still allow to stable clusters to join successfully even if there's only device that bridges the two clusters (by having a device from each in it's TCPPING.initial_hosts).

As I said, this is a little more complicated to code, but the Discovery stack element already has access to the current view's member list so could easily split the result of `fetchClusterMembers` into those that are already in the cluster from those that are not, but should be.

Thoughts?  If you agree in principle, I'm happy to go ahead and code this fix up instead and update the PR.

> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
>                 Key: JGRP-1670
>                 URL: https://issues.jboss.org/browse/JGRP-1670
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.3.1
>         Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
>            Reporter: Andy Caldwell
>            Assignee: Bela Ban
>             Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254).  This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node.  Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address.  The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira