[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails
Bela Ban (JIRA)
jira-events at lists.jboss.org
Fri Aug 2 09:47:26 EDT 2013
[ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794791#comment-12794791 ]
Bela Ban commented on JGRP-1670:
--------------------------------
OK, here's the core of the problem: say we have nodes A and B.
* TCPPING.initial_hosts contains only A
* A starts up and forms a singleton cluster
* B starts up, sends a discovery request to A, but doesn't get a response as the firewall rule discards the request
* The firewall rule is removed
* MERGE2 on B now discovers A and sends a unicast to A
* A gets the unicast from B, but cannot send a response because the physical address for B is not in its cache
* A now (in TP.sendToSingleMember()) sends up a GET_PHYSICAL_ADDRESS event which triggers a discovery round *with view_id=null*
* However, as TCPPING.initial_hosts only lists A, but not B, the discovery request doesn't return B's information
SOLUTION:
Simply add B to TCPPING.initial_hosts as well. I've always stated that TCPPING is for static discovery and therefore should always list the IP addresses and ports of *all* members.
You could look into using PDC [1] in combination with TCPPING.
[1] http://belaban.blogspot.ch/2012/11/persisting-discovery-responses-with.html
> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
> Key: JGRP-1670
> URL: https://issues.jboss.org/browse/JGRP-1670
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.3.1
> Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
> Reporter: Andy Caldwell
> Assignee: Bela Ban
> Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254). This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node. Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address. The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list