[jboss-jira] [JBoss JIRA] (JGRP-1829) Failing to connect to an unavailable member blocks all message sending
David Hotham (JIRA)
issues at jboss.org
Thu Apr 17 12:54:33 EDT 2014
[ https://issues.jboss.org/browse/JGRP-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12962615#comment-12962615 ]
David Hotham commented on JGRP-1829:
------------------------------------
What's the expected timetable for JGroups 4.0? I suppose that it's behind JGroups 3.5 - of course! - and so some way off yet.
I had hoped that you'd want to make a fix to this one before that. I guess I'd had in mind something like:
- if we need to get a new connection while trying a send() then put the the message that we're sending onto a queue
- continue opening the connection in a new thread (so the original thread is unblocked)
- while the queue for some destination is non-empty, queue all messages to it so as to preserve ordering
- when the connect completes (either successfully or not) process the queued messages
I understand that this is a bit vague and the devil is in the details - but something along these lines should be possible, shouldn't it?
Your suggestion to use UDP is interesting. I think we're probably using TCP only because I'd assumed that TCPPING would need TCP transport. But I can't immediately see why TCPPING wouldn't work fine with UDP transport. Is that right?
Still, I'd prefer to explore a fix. A change in transport feels like a much bigger change for us and would require quite a bit more testing than a targeted fix. Also, a fix would make JGroups better - and that must be good, right?
> Failing to connect to an unavailable member blocks all message sending
> ----------------------------------------------------------------------
>
> Key: JGRP-1829
> URL: https://issues.jboss.org/browse/JGRP-1829
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.4.2
> Reporter: David Hotham
> Assignee: Bela Ban
>
> Hi,
> We're seeing a problem which appears to be caused by the TransferQueueBundler thread being blocked while it fails to connect to an unavailable member.
> The setup we have is a cluster split across two sites: say, members 0 through 4 in site A and members 5 through 9 in site B. Initially the cluster is complete: everyone has the same view. The case that we're testing is: what happens when connectivity is lost between the sites? NB we're using TCP transport.
> Obviously the expected result is that we'd get two sub-clusters, one in each site. But this doesn't always happen. Instead we sometimes see some members become singletons (that is, with only themselves in view).
> What seems to be happening is something like this:
> - When the cross-site link is cut, members in site A suspect members in site B (and vice versa).
> - So in each site there's a broadcast of SUSPECT messages
> - Now each of the members in site A tries to VERIFY_SUSPECT each of the members in site B
> - Each such attempt blocks the TransferQueueBundler for two seconds (TCP's default sock_conn_timeout), because we can't contact any member in the other site
> - But that introduces a delay for _all_ messages, not only for messages to the 'other' site
> - If there are enough members in the 'other' site, we can easily get a large enough delay that HEARTBEAT (and then VERIFY_SUSPECT) messages start timing out between members in the same site
> - At this point, members that ought to be able to see one another start to report that they cannot do so.
> We've seen cases where a member becomes completely isolated - forming a singleton cluster - and does not recover. Unfortunately we don't have full trace from that run, so it's not clear why the cluster didn't eventually recover. I suspect that we're hitting something like JGRP-1493, in which delays sending messages (in that case, a delay when failing to get a physical address) caused the MergeKiller always to prevent merging.
> It is highly undesirable that when a cluster contains several unavailable members, as in a partition between two sites, this should cause problems for members that can see one another.
> Should all message sending really be blocked while failing to connect to an unavailable member?
> This issue seems related also to JGRP-1815 which raises a similar question: should all message sending really be blocked while failing to find a physical address?
> What do you think?
> - do you agree that blocking message sending while attempting to connect to an unavailable member is undesirable?
> - if so, what do you think the right fix is? If it's not too hard, we may be able to find time to take a look at implementing this ourselves.
> - is there anything else we can do to help progress this issue?
> We're using JGroups 3.4.2. I've attached the code fragment with which we configure the stack below.
> Thanks for your help
> David
> {noformat}
> stack.addProtocol((new TCP)
> .setValue("enable_diagnostics", false)
> .setValue("logical_addr_cache_max_size", 70)
> .setValue("logical_addr_cache_expiration", 10000)
> .setValue("physical_addr_max_fetch_attempts", 1)
> .setValue("bind_addr", localAddr)
> .setValue("bind_port", basePort)
> .setValue("port_range", 0))
> val tcpping = new TCPPING
> val jhosts = initialHosts map { addr => new IpAddress(addr.getHostAddress, basePort) }
> tcpping.setInitialHosts(jhosts)
> tcpping.setPortRange(0)
> tcpping.setValue("return_entire_cache", true)
> stack.addProtocol(tcpping)
> .addProtocol(new MERGE3)
> .addProtocol((new FD_SOCK)
> .setValue("bind_addr", localAddr)
> .setValue("client_bind_port", basePort + 1)
> .setValue("start_port", basePort + 101)
> .setValue("suspect_msg_interval", 1000))
> .addProtocol(new FD)
> .addProtocol((new VERIFY_SUSPECT)
> .setValue("timeout", 1000))
> .addProtocol((new NAKACK2)
> .setValue("use_mcast_xmit", false))
> .addProtocol(new UNICAST3)
> .addProtocol(new STABLE)
> .addProtocol(new MFC)
> .addProtocol(new SEQUENCER)
> .addProtocol((new GMS)
> .setValue("max_join_attempts", 3)
> .setValue("use_delta_views", false))
> .addProtocol(new FRAG2)
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list