[jboss-jira] [JBoss JIRA] (JGRP-1829) Failing to connect to an unavailable member blocks all message sending

Thu Apr 17 12:54:33 EDT 2014

    [ https://issues.jboss.org/browse/JGRP-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12962615#comment-12962615 ] 

David Hotham commented on JGRP-1829:
------------------------------------

What's the expected timetable for JGroups 4.0?  I suppose that it's behind JGroups 3.5 - of course! - and so some way off yet.

I had hoped that you'd want to make a fix to this one before that.  I guess I'd had in mind something like:

-  if we need to get a new connection while trying a send() then put the the message that we're sending onto a queue
-  continue opening the connection in a new thread (so the original thread is unblocked)
-  while the queue for some destination is non-empty, queue all messages to it so as to preserve ordering
-  when the connect completes (either successfully or not) process the queued messages

I understand that this is a bit vague and the devil is in the details - but something along these lines should be possible, shouldn't it?

Your suggestion to use UDP is interesting.  I think we're probably using TCP only because I'd assumed that TCPPING would need TCP transport.  But I can't immediately see why TCPPING wouldn't work fine with UDP transport.  Is that right?

Still, I'd prefer to explore a fix.  A change in transport feels like a much bigger change for us and would require quite a bit more testing than a targeted fix.  Also, a fix would make JGroups better - and that must be good, right?

> Failing to connect to an unavailable member blocks all message sending
> ----------------------------------------------------------------------
>
>                 Key: JGRP-1829
>                 URL: https://issues.jboss.org/browse/JGRP-1829
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.4.2
>            Reporter: David Hotham
>            Assignee: Bela Ban
>
> Hi,
> We're seeing a problem which appears to be caused by the TransferQueueBundler thread being blocked while it fails to connect to an unavailable member.
> The setup we have is a cluster split across two sites: say, members 0 through 4 in site A and members 5 through 9 in site B.  Initially the cluster is complete: everyone has the same view.  The case that we're testing is: what happens when connectivity is lost between the sites?  NB we're using TCP transport.
> Obviously the expected result is that we'd get two sub-clusters, one in each site.  But this doesn't always happen.  Instead we sometimes see some members become singletons (that is, with only themselves in view).
> What seems to be happening is something like this:
> -  When the cross-site link is cut, members in site A suspect members in site B (and vice versa).
> -  So in each site there's a broadcast of SUSPECT messages
> -  Now each of the members in site A tries to VERIFY_SUSPECT each of the members in site B
> -  Each such attempt blocks the TransferQueueBundler for two seconds (TCP's default sock_conn_timeout), because we can't contact any member in the other site
> -  But that introduces a delay for _all_ messages, not only for messages to the 'other' site
> -  If there are enough members in the 'other' site, we can easily get a large enough delay that HEARTBEAT (and then VERIFY_SUSPECT) messages start timing out between members in the same site
> -  At this point, members that ought to be able to see one another start to report that they cannot do so.
> We've seen cases where a member becomes completely isolated - forming a singleton cluster - and does not recover.  Unfortunately we don't have full trace from that run, so it's not clear why the cluster didn't eventually recover.  I suspect that we're hitting something like JGRP-1493, in which delays sending messages (in that case, a delay when failing to get a physical address) caused the MergeKiller always to prevent merging.
> It is highly undesirable that when a cluster contains several unavailable members, as in a partition between two sites, this should cause problems for members that can see one another.
> Should all message sending really be blocked while failing to connect to an unavailable member?
> This issue seems related also to JGRP-1815 which raises a similar question: should all message sending really be blocked while failing to find a physical address?
> What do you think?
> -  do you agree that blocking message sending while attempting to connect to an unavailable member is undesirable?
> -  if so, what do you think the right fix is?  If it's not too hard, we may be able to find time to take a look at implementing this ourselves.
> -  is there anything else we can do to help progress this issue?
> We're using JGroups 3.4.2. I've attached the code fragment with which we configure the stack below.
> Thanks for your help
> David
> {noformat}
>  stack.addProtocol((new TCP)
>    .setValue("enable_diagnostics", false)
>    .setValue("logical_addr_cache_max_size", 70)
>    .setValue("logical_addr_cache_expiration", 10000)
>    .setValue("physical_addr_max_fetch_attempts", 1)
>    .setValue("bind_addr", localAddr)
>    .setValue("bind_port", basePort)
>    .setValue("port_range", 0))
>  val tcpping = new TCPPING
>  val jhosts = initialHosts map { addr => new IpAddress(addr.getHostAddress, basePort) }
>  tcpping.setInitialHosts(jhosts)
>  tcpping.setPortRange(0)
>  tcpping.setValue("return_entire_cache", true)
>  stack.addProtocol(tcpping)
>    .addProtocol(new MERGE3)
>    .addProtocol((new FD_SOCK)
>      .setValue("bind_addr", localAddr)
>      .setValue("client_bind_port", basePort + 1)
>      .setValue("start_port", basePort + 101)
>      .setValue("suspect_msg_interval", 1000))
>    .addProtocol(new FD)
>    .addProtocol((new VERIFY_SUSPECT)
>      .setValue("timeout", 1000))
>    .addProtocol((new NAKACK2)
>      .setValue("use_mcast_xmit", false))
>    .addProtocol(new UNICAST3)
>    .addProtocol(new STABLE)
>    .addProtocol(new MFC)
>    .addProtocol(new SEQUENCER)
>    .addProtocol((new GMS)
>      .setValue("max_join_attempts", 3)
>      .setValue("use_delta_views", false))
>    .addProtocol(new FRAG2)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira