[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails

Bela Ban (JIRA) jira-events at lists.jboss.org
Fri Aug 2 06:39:26 EDT 2013


     [ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban updated JGRP-1670:
---------------------------

    Fix Version/s: 3.3.5
                   3.4

    
> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
>                 Key: JGRP-1670
>                 URL: https://issues.jboss.org/browse/JGRP-1670
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.3.1
>         Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
>            Reporter: Andy Caldwell
>            Assignee: Bela Ban
>             Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254).  This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node.  Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address.  The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


More information about the jboss-jira mailing list