[jboss-jira] [JBoss JIRA] (JGRP-1829) Failing to connect to an unavailable member blocks all message sending

Thursday, 17 April 2014

    [
https://issues.jboss.org/browse/JGRP-1829?page=com.atlassian.jira.plugin....
] 

David Hotham commented on JGRP-1829:
------------------------------------

What's the expected timetable for JGroups 4.0?  I suppose that it's behind JGroups
3.5 - of course! - and so some way off yet.

I had hoped that you'd want to make a fix to this one before that.  I guess I'd
had in mind something like:

-  if we need to get a new connection while trying a send() then put the the message that
we're sending onto a queue
-  continue opening the connection in a new thread (so the original thread is unblocked)
-  while the queue for some destination is non-empty, queue all messages to it so as to
preserve ordering
-  when the connect completes (either successfully or not) process the queued messages

I understand that this is a bit vague and the devil is in the details - but something
along these lines should be possible, shouldn't it?

Your suggestion to use UDP is interesting.  I think we're probably using TCP only
because I'd assumed that TCPPING would need TCP transport.  But I can't
immediately see why TCPPING wouldn't work fine with UDP transport.  Is that right?

Still, I'd prefer to explore a fix.  A change in transport feels like a much bigger
change for us and would require quite a bit more testing than a targeted fix.  Also, a fix
would make JGroups better - and that must be good, right?

...
 Failing to connect to an unavailable member blocks all message
sending
 ----------------------------------------------------------------------

                 Key: JGRP-1829
                 URL: https://issues.jboss.org/browse/JGRP-1829
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.4.2
            Reporter: David Hotham
            Assignee: Bela Ban

 Hi,
 We're seeing a problem which appears to be caused by the TransferQueueBundler thread
being blocked while it fails to connect to an unavailable member.
 The setup we have is a cluster split across two sites: say, members 0 through 4 in site A
and members 5 through 9 in site B.  Initially the cluster is complete: everyone has the
same view.  The case that we're testing is: what happens when connectivity is lost
between the sites?  NB we're using TCP transport.
 Obviously the expected result is that we'd get two sub-clusters, one in each site. 
But this doesn't always happen.  Instead we sometimes see some members become
singletons (that is, with only themselves in view).
 What seems to be happening is something like this:
 -  When the cross-site link is cut, members in site A suspect members in site B (and vice
versa).
 -  So in each site there's a broadcast of SUSPECT messages
 -  Now each of the members in site A tries to VERIFY_SUSPECT each of the members in site
B
 -  Each such attempt blocks the TransferQueueBundler for two seconds (TCP's default
sock_conn_timeout), because we can't contact any member in the other site
 -  But that introduces a delay for _all_ messages, not only for messages to the
'other' site
 -  If there are enough members in the 'other' site, we can easily get a large
enough delay that HEARTBEAT (and then VERIFY_SUSPECT) messages start timing out between
members in the same site
 -  At this point, members that ought to be able to see one another start to report that
they cannot do so.
 We've seen cases where a member becomes completely isolated - forming a singleton
cluster - and does not recover.  Unfortunately we don't have full trace from that run,
so it's not clear why the cluster didn't eventually recover.  I suspect that
we're hitting something like JGRP-1493, in which delays sending messages (in that
case, a delay when failing to get a physical address) caused the MergeKiller always to
prevent merging.
 It is highly undesirable that when a cluster contains several unavailable members, as in
a partition between two sites, this should cause problems for members that can see one
another.
 Should all message sending really be blocked while failing to connect to an unavailable
member?
 This issue seems related also to JGRP-1815 which raises a similar question: should all
message sending really be blocked while failing to find a physical address?
 What do you think?
 -  do you agree that blocking message sending while attempting to connect to an
unavailable member is undesirable?
 -  if so, what do you think the right fix is?  If it's not too hard, we may be able
to find time to take a look at implementing this ourselves.
 -  is there anything else we can do to help progress this issue?
 We're using JGroups 3.4.2. I've attached the code fragment with which we
configure the stack below.
 Thanks for your help
 David
 {noformat}
  stack.addProtocol((new TCP)
    .setValue("enable_diagnostics", false)
    .setValue("logical_addr_cache_max_size", 70)
    .setValue("logical_addr_cache_expiration", 10000)
    .setValue("physical_addr_max_fetch_attempts", 1)
    .setValue("bind_addr", localAddr)
    .setValue("bind_port", basePort)
    .setValue("port_range", 0))
  val tcpping = new TCPPING
  val jhosts = initialHosts map { addr => new IpAddress(addr.getHostAddress, basePort)
}
  tcpping.setInitialHosts(jhosts)
  tcpping.setPortRange(0)
  tcpping.setValue("return_entire_cache", true)
  stack.addProtocol(tcpping)
    .addProtocol(new MERGE3)
    .addProtocol((new FD_SOCK)
      .setValue("bind_addr", localAddr)
      .setValue("client_bind_port", basePort + 1)
      .setValue("start_port", basePort + 101)
      .setValue("suspect_msg_interval", 1000))
    .addProtocol(new FD)
    .addProtocol((new VERIFY_SUSPECT)
      .setValue("timeout", 1000))
    .addProtocol((new NAKACK2)
      .setValue("use_mcast_xmit", false))
    .addProtocol(new UNICAST3)
    .addProtocol(new STABLE)
    .addProtocol(new MFC)
    .addProtocol(new SEQUENCER)
    .addProtocol((new GMS)
      .setValue("max_join_attempts", 3)
      .setValue("use_delta_views", false))
    .addProtocol(new FRAG2)
 {noformat} 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1829) Failing to connect to an unavailable member blocks all message sending