[
https://issues.jboss.org/browse/JGRP-1493?page=com.atlassian.jira.plugin....
]
Bela Ban commented on JGRP-1493:
--------------------------------
First off, I think D should form a cluster of D and A even with FD (it will be faster with
FD_ALL):
- We have {B', D, A, C'}
- With FD, D pings A and A pings C'
- After not receiving acks from C' for a number of times, A excludes C' from its
ping set and starts sending out SUSPECT(C') messages
- A now starts pinging B'
- After some time, A excludes B' from its ping set and starts pinging D. It now starts
sending SUSPECT(C', B') messages
- D gets the SUSPECT(B') message, takes over as coordinator, creates view {D,A} and
installs it
So it'll take approximately 2 times (timeout * max_tries) milliseconds to establish
the new view.
With FD_ALL, this is faster: after timeout (+ (possibly) timeout_check_interval), B'
and C' will get excluded.
So using FD_ALL instead of FD should only be faster, but both protocols should end up with
view {D,A}. Once this happens, a merge should succeed.
If the view is not established, I'd rather investigate why this is the case, and not
work around this ! WDYT ?
I'll see if I can come up with a unit test for the root cause...
Merge fails because failing to get physical address takes too long
------------------------------------------------------------------
Key: JGRP-1493
URL:
https://issues.jboss.org/browse/JGRP-1493
Project: JGroups
Issue Type: Feature Request
Affects Versions: 3.1
Reporter: David Hotham
Assignee: Bela Ban
Fix For: 3.2
Start with the following views:
- A, B and C all have {A,B,C}
- D has {B', D, A, C'}, where B' and C' are dead.
A decides to lead a merge (he's the only 'actual' coordinator). By the time
we've been through view-sanitization and so on and reached
getMergeDataFromSubgroupCoordinators(), coords are {D, C', A}.
Here A tries to send MERGE_REQ to those elements. However, A does not have a physical
address for C', and in fact nor does anyone else. So when trying to send the
MERGE_REQ to C', A will always spend a little over 5 seconds in
TP.sendToSingleMember() - trying and failing to discover that physical address.
Of course A won't get a response from C' either, so it will take another 5
seconds for merge_rsps.waitForAllResponses to time out.
But that means that it's a sure thing that the MergeKiller will kick in first.
Therefore the merge can never progress.
(Presumably the situation would be even worse if D's view had contained further dead
members).
I expect to work around this by tweaking the timings somewhere: probably in
startMergeKiller, so that the MergeKiller takes longer to be scheduled.
I'd think that the right fix would be to arrange that the MergeTask is not blocked by
TP having no physical address for a member.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira