[jboss-jira] [JBoss JIRA] (JGRP-1493) Merge fails because failing to get physical address takes too long

Fri Aug 31 06:29:32 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715093#comment-12715093 ] 

Bela Ban commented on JGRP-1493:
--------------------------------

First off, I think D should form a cluster of D and A even with FD (it will be faster with FD_ALL):
- We have {B', D, A, C'}
- With FD, D pings A and A pings C'
- After not receiving acks from C' for a number of times, A excludes C' from its ping set and starts sending out SUSPECT(C') messages
- A now starts pinging B'
- After some time, A excludes B' from its ping set and starts pinging D. It now starts sending SUSPECT(C', B') messages
- D gets the SUSPECT(B') message, takes over as coordinator, creates view {D,A} and installs it

So it'll take approximately 2 times (timeout * max_tries) milliseconds to establish the new view.

With FD_ALL, this is faster: after timeout (+ (possibly) timeout_check_interval), B' and C' will get excluded.

So using FD_ALL instead of FD should only be faster, but both protocols should end up with view {D,A}. Once this happens, a merge should succeed.

If the view is not established, I'd rather investigate why this is the case, and not work around this ! WDYT ?
I'll see if I can come up with a unit test for the root cause...

> Merge fails because failing to get physical address takes too long
> ------------------------------------------------------------------
>
>                 Key: JGRP-1493
>                 URL: https://issues.jboss.org/browse/JGRP-1493
>             Project: JGroups
>          Issue Type: Feature Request
>    Affects Versions: 3.1
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.2
>
>
> Start with the following views:
> -  A, B and C all have {A,B,C}
> -  D has {B', D, A, C'}, where B' and C' are dead.
> A decides to lead a merge (he's the only 'actual' coordinator).  By the time we've been through view-sanitization and so on and reached getMergeDataFromSubgroupCoordinators(), coords are {D, C', A}.
> Here A tries to send MERGE_REQ to those elements.  However, A does not have a physical address for C', and in fact nor does anyone else.  So when trying to send the MERGE_REQ to C', A will always spend a little over 5 seconds in TP.sendToSingleMember() - trying and failing to discover that physical address.
> Of course A won't get a response from C' either, so it will take another 5 seconds for merge_rsps.waitForAllResponses to time out.
> But that means that it's a sure thing that the MergeKiller will kick in first.
> Therefore the merge can never progress.  
> (Presumably the situation would be even worse if D's view had contained further dead members).
> I expect to work around this by tweaking the timings somewhere: probably in startMergeKiller, so that the MergeKiller takes longer to be scheduled.
> I'd think that the right fix would be to arrange that the MergeTask is not blocked by TP having no physical address for a member.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira