[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails

Fri Aug 2 09:47:26 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794791#comment-12794791 ] 

Bela Ban commented on JGRP-1670:
--------------------------------

OK, here's the core of the problem: say we have nodes A and B.
* TCPPING.initial_hosts contains only A
* A starts up and forms a singleton cluster
* B starts up, sends a discovery request to A, but doesn't get a response as the firewall rule discards the request
* The firewall rule is removed
* MERGE2 on B now discovers A and sends a unicast to A
* A gets the unicast from B, but cannot send a response because the physical address for B is not in its cache
* A now (in TP.sendToSingleMember()) sends up a GET_PHYSICAL_ADDRESS event which triggers a discovery round *with view_id=null*
* However, as TCPPING.initial_hosts only lists A, but not B, the discovery request doesn't return B's information

SOLUTION: 
Simply add B to TCPPING.initial_hosts as well. I've always stated that TCPPING is for static discovery and therefore should always list the IP addresses and ports of *all* members.
You could look into using PDC [1] in combination with TCPPING.

[1] http://belaban.blogspot.ch/2012/11/persisting-discovery-responses-with.html

> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
>                 Key: JGRP-1670
>                 URL: https://issues.jboss.org/browse/JGRP-1670
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.3.1
>         Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
>            Reporter: Andy Caldwell
>            Assignee: Bela Ban
>             Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254).  This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node.  Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address.  The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira