[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails

Fri Aug 2 20:49:26 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794921#comment-12794921 ] 

Andy Caldwell commented on JGRP-1670:
-------------------------------------

Unfortunately, we're using jGroups underneath Infinispan so we don't have the ability to modify the TCPPING.initial_hosts programatically.

Our project, [Project Clearwater|http://www.projectclearwater.org], strives to be as virtualized as possible so we really don't want to have any dependencies on particular clouds, and, where possible we want to find solutions that work on any given network/hardware.  We're using EC2 as our testing playground, but we've also tested on Rackspace, Openstack, VMWare, VirtualBox and various other platforms.

The reason we saw this issue in the first place is that EC2 seems to add a rather large (> 1 second) delay to the very first TCP connection one instance makes to another instance (we hypothesise the delay is for checking the EC2 security group config).  Unfortunately, when spinning up new Infinispan nodes, the very first connection they make to the other cluster members is the initial discovery, which then times out and consistently leads to this issue.  This really hampers TCPPING on EC2 and, as we've discussed, none of the other discovery protocols are applicable to our needs.

Since TCPPING (with PDC) is so very close to when we are looking for, I hope we can come up with a fix for this behaviour that doesn't negatively impact the behaviour in the golden path.

> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
>                 Key: JGRP-1670
>                 URL: https://issues.jboss.org/browse/JGRP-1670
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.3.1
>         Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
>            Reporter: Andy Caldwell
>            Assignee: Bela Ban
>             Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254).  This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node.  Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address.  The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira