[jboss-jira] [JBoss JIRA] (JGRP-1829) Failing to connect to an unavailable member blocks all message sending

Thu Apr 17 06:16:35 EDT 2014

    [ https://issues.jboss.org/browse/JGRP-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12962477#comment-12962477 ] 

Bela Ban commented on JGRP-1829:
--------------------------------

Hmm, I assume you cause the split by shutting down the switch between sites A and B ? This means the TCP connection attempts will not return immediately, but wait until sock_conn_timeout kicks in.

In case of a split, there needs to be traffic between the 2 partitions for some time, until the failure detection protocols and VERIFY_SUSPECT are done and we have 2 separate partitions. Note that even then there is some amount of traffic: the MERGE protocols always try to reach the other side.

I don't think there's an easy fix to this. One thing you could try though is to replace FD with FD_ALL and remove VERIFT_SUSPECT. Both FD_SOCK and FD_ALL have only the coordinator doing VERIFY_SUSPECT work, whereas in FD, everyone does this.

The basic issue though is that TCP is blocking and that's bad when a split occurs. With JGroups 4.0, we're moving to a non-blocking TCP impl and then this shouldn't be an issue anymore.

Another suggestion is to use UDP instead of TCP. If you cannot switch because IP multicasting is not provided (e.g. in a cloud environment), you could still use UDP datagrams (UDP.ip_mcast=false).

> Failing to connect to an unavailable member blocks all message sending
> ----------------------------------------------------------------------
>
>                 Key: JGRP-1829
>                 URL: https://issues.jboss.org/browse/JGRP-1829
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.4.2
>            Reporter: David Hotham
>            Assignee: Bela Ban
>
> Hi,
> We're seeing a problem which appears to be caused by the TransferQueueBundler thread being blocked while it fails to connect to an unavailable member.
> The setup we have is a cluster split across two sites: say, members 0 through 4 in site A and members 5 through 9 in site B.  Initially the cluster is complete: everyone has the same view.  The case that we're testing is: what happens when connectivity is lost between the sites?  NB we're using TCP transport.
> Obviously the expected result is that we'd get two sub-clusters, one in each site.  But this doesn't always happen.  Instead we sometimes see some members become singletons (that is, with only themselves in view).
> What seems to be happening is something like this:
> -  When the cross-site link is cut, members in site A suspect members in site B (and vice versa).
> -  So in each site there's a broadcast of SUSPECT messages
> -  Now each of the members in site A tries to VERIFY_SUSPECT each of the members in site B
> -  Each such attempt blocks the TransferQueueBundler for two seconds (TCP's default sock_conn_timeout), because we can't contact any member in the other site
> -  But that introduces a delay for _all_ messages, not only for messages to the 'other' site
> -  If there are enough members in the 'other' site, we can easily get a large enough delay that HEARTBEAT (and then VERIFY_SUSPECT) messages start timing out between members in the same site
> -  At this point, members that ought to be able to see one another start to report that they cannot do so.
> We've seen cases where a member becomes completely isolated - forming a singleton cluster - and does not recover.  Unfortunately we don't have full trace from that run, so it's not clear why the cluster didn't eventually recover.  I suspect that we're hitting something like JGRP-1493, in which delays sending messages (in that case, a delay when failing to get a physical address) caused the MergeKiller always to prevent merging.
> It is highly undesirable that when a cluster contains several unavailable members, as in a partition between two sites, this should cause problems for members that can see one another.
> Should all message sending really be blocked while failing to connect to an unavailable member?
> This issue seems related also to JGRP-1815 which raises a similar question: should all message sending really be blocked while failing to find a physical address?
> What do you think?
> -  do you agree that blocking message sending while attempting to connect to an unavailable member is undesirable?
> -  if so, what do you think the right fix is?  If it's not too hard, we may be able to find time to take a look at implementing this ourselves.
> -  is there anything else we can do to help progress this issue?
> We're using JGroups 3.4.2. I've attached the code fragment with which we configure the stack below.
> Thanks for your help
> David
> {noformat}
>  stack.addProtocol((new TCP)
>    .setValue("enable_diagnostics", false)
>    .setValue("logical_addr_cache_max_size", 70)
>    .setValue("logical_addr_cache_expiration", 10000)
>    .setValue("physical_addr_max_fetch_attempts", 1)
>    .setValue("bind_addr", localAddr)
>    .setValue("bind_port", basePort)
>    .setValue("port_range", 0))
>  val tcpping = new TCPPING
>  val jhosts = initialHosts map { addr => new IpAddress(addr.getHostAddress, basePort) }
>  tcpping.setInitialHosts(jhosts)
>  tcpping.setPortRange(0)
>  tcpping.setValue("return_entire_cache", true)
>  stack.addProtocol(tcpping)
>    .addProtocol(new MERGE3)
>    .addProtocol((new FD_SOCK)
>      .setValue("bind_addr", localAddr)
>      .setValue("client_bind_port", basePort + 1)
>      .setValue("start_port", basePort + 101)
>      .setValue("suspect_msg_interval", 1000))
>    .addProtocol(new FD)
>    .addProtocol((new VERIFY_SUSPECT)
>      .setValue("timeout", 1000))
>    .addProtocol((new NAKACK2)
>      .setValue("use_mcast_xmit", false))
>    .addProtocol(new UNICAST3)
>    .addProtocol(new STABLE)
>    .addProtocol(new MFC)
>    .addProtocol(new SEQUENCER)
>    .addProtocol((new GMS)
>      .setValue("max_join_attempts", 3)
>      .setValue("use_delta_views", false))
>    .addProtocol(new FRAG2)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira