[jboss-jira] [JBoss JIRA] (JGRP-1670) Cluster doesn't heal after first discovery fails

Fri Aug 2 12:11:26 EDT 2013

    [ https://issues.jboss.org/browse/JGRP-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794831#comment-12794831 ] 

Bela Ban commented on JGRP-1670:
--------------------------------

{quote}
Unfortunately we cannot afford to restart our jgroups cluster in order to add new devices, the addition needs to be seamless, so I can't see a way to update A's view of TCPPING.initial_hosts to include B.
{quote}

You could add a new element dynamically, e.g.
{code}
TCPPING ping=(TCPPING)channel.getProtocolStack().findProtocol(TCPPING.class);
List<IpAddress> initial_hosts=ping.getInitialHosts();
initial_hosts.add(new IpAddress(host, port));
ping.setInitialHosts(); // not really needed
{code}

{quote}
We chose to use TCPPING as the discovery mechanism since it is the one that ties us the least into a particular network/cloud infrastructure:

PING/MPING/BPING require multicast/broadcast support which most clouds don't allow.
S3_PING/SWIFT_PING/AWS_PING/RACKSPACE_PING are all specific to a given cloud
TCPGOSSIP/JDBC_PING/FILE_PING require the addition of extra services to handle discovery, adding complexity to the deployment, extra costs and potential single points of failure (or even more complex orchestration)
{quote}

Didn't you say you'd run on EC2 ? So why can't this solution be EC2 specific ? Or is EC2 just your testing ground and this needs to work on different types of clouds ?

{quote}
TCPPING was also suggested to us by the Infinispan developers as being a good solution on EC2.
{quote}

I know of at least one bigger system which uses TCPPING and updates individual nodes dynamically via the programmatic API above. They use JMS to disseminate changes, ie. new members to be added.

{quote}
Is there another discovery protocol we should be using? One that works across all IP networks, with no single points of failure and no external service requirements?
{quote}

No, there isn't really a protocol that works across all types of clouds (if that's what you want). However, perhaps as part of your installation routine, you could install and configure S3_PING or AWS_PING for EC2, and other discovery protocols for other clouds ?

> Cluster doesn't heal after first discovery fails
> ------------------------------------------------
>
>                 Key: JGRP-1670
>                 URL: https://issues.jboss.org/browse/JGRP-1670
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.3.1
>         Environment: Ubuntu 12.04 on EC2 using OpenJDK 1.6.0_27 from the APT repositories.
>            Reporter: Andy Caldwell
>            Assignee: Bela Ban
>             Fix For: 3.3.5, 3.4
>
>
> When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.
> Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254).  This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node.  Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address.  The cluster never heals.
> If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.
> This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira