]
Bela Ban commented on JGRP-1829:
--------------------------------
Hmm, I assume you cause the split by shutting down the switch between sites A and B ? This
means the TCP connection attempts will not return immediately, but wait until
sock_conn_timeout kicks in.
In case of a split, there needs to be traffic between the 2 partitions for some time,
until the failure detection protocols and VERIFY_SUSPECT are done and we have 2 separate
partitions. Note that even then there is some amount of traffic: the MERGE protocols
always try to reach the other side.
I don't think there's an easy fix to this. One thing you could try though is to
replace FD with FD_ALL and remove VERIFT_SUSPECT. Both FD_SOCK and FD_ALL have only the
coordinator doing VERIFY_SUSPECT work, whereas in FD, everyone does this.
The basic issue though is that TCP is blocking and that's bad when a split occurs.
With JGroups 4.0, we're moving to a non-blocking TCP impl and then this shouldn't
be an issue anymore.
Another suggestion is to use UDP instead of TCP. If you cannot switch because IP
multicasting is not provided (e.g. in a cloud environment), you could still use UDP
datagrams (UDP.ip_mcast=false).
Failing to connect to an unavailable member blocks all message
sending
----------------------------------------------------------------------
Key: JGRP-1829
URL:
https://issues.jboss.org/browse/JGRP-1829
Project: JGroups
Issue Type: Bug
Affects Versions: 3.4.2
Reporter: David Hotham
Assignee: Bela Ban
Hi,
We're seeing a problem which appears to be caused by the TransferQueueBundler thread
being blocked while it fails to connect to an unavailable member.
The setup we have is a cluster split across two sites: say, members 0 through 4 in site A
and members 5 through 9 in site B. Initially the cluster is complete: everyone has the
same view. The case that we're testing is: what happens when connectivity is lost
between the sites? NB we're using TCP transport.
Obviously the expected result is that we'd get two sub-clusters, one in each site.
But this doesn't always happen. Instead we sometimes see some members become
singletons (that is, with only themselves in view).
What seems to be happening is something like this:
- When the cross-site link is cut, members in site A suspect members in site B (and vice
versa).
- So in each site there's a broadcast of SUSPECT messages
- Now each of the members in site A tries to VERIFY_SUSPECT each of the members in site
B
- Each such attempt blocks the TransferQueueBundler for two seconds (TCP's default
sock_conn_timeout), because we can't contact any member in the other site
- But that introduces a delay for _all_ messages, not only for messages to the
'other' site
- If there are enough members in the 'other' site, we can easily get a large
enough delay that HEARTBEAT (and then VERIFY_SUSPECT) messages start timing out between
members in the same site
- At this point, members that ought to be able to see one another start to report that
they cannot do so.
We've seen cases where a member becomes completely isolated - forming a singleton
cluster - and does not recover. Unfortunately we don't have full trace from that run,
so it's not clear why the cluster didn't eventually recover. I suspect that
we're hitting something like JGRP-1493, in which delays sending messages (in that
case, a delay when failing to get a physical address) caused the MergeKiller always to
prevent merging.
It is highly undesirable that when a cluster contains several unavailable members, as in
a partition between two sites, this should cause problems for members that can see one
another.
Should all message sending really be blocked while failing to connect to an unavailable
member?
This issue seems related also to JGRP-1815 which raises a similar question: should all
message sending really be blocked while failing to find a physical address?
What do you think?
- do you agree that blocking message sending while attempting to connect to an
unavailable member is undesirable?
- if so, what do you think the right fix is? If it's not too hard, we may be able
to find time to take a look at implementing this ourselves.
- is there anything else we can do to help progress this issue?
We're using JGroups 3.4.2. I've attached the code fragment with which we
configure the stack below.
Thanks for your help
David
{noformat}
stack.addProtocol((new TCP)
.setValue("enable_diagnostics", false)
.setValue("logical_addr_cache_max_size", 70)
.setValue("logical_addr_cache_expiration", 10000)
.setValue("physical_addr_max_fetch_attempts", 1)
.setValue("bind_addr", localAddr)
.setValue("bind_port", basePort)
.setValue("port_range", 0))
val tcpping = new TCPPING
val jhosts = initialHosts map { addr => new IpAddress(addr.getHostAddress, basePort)
}
tcpping.setInitialHosts(jhosts)
tcpping.setPortRange(0)
tcpping.setValue("return_entire_cache", true)
stack.addProtocol(tcpping)
.addProtocol(new MERGE3)
.addProtocol((new FD_SOCK)
.setValue("bind_addr", localAddr)
.setValue("client_bind_port", basePort + 1)
.setValue("start_port", basePort + 101)
.setValue("suspect_msg_interval", 1000))
.addProtocol(new FD)
.addProtocol((new VERIFY_SUSPECT)
.setValue("timeout", 1000))
.addProtocol((new NAKACK2)
.setValue("use_mcast_xmit", false))
.addProtocol(new UNICAST3)
.addProtocol(new STABLE)
.addProtocol(new MFC)
.addProtocol(new SEQUENCER)
.addProtocol((new GMS)
.setValue("max_join_attempts", 3)
.setValue("use_delta_views", false))
.addProtocol(new FRAG2)
{noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: