[jboss-jira] [JBoss JIRA] (JGRP-1493) Merge fails because failing to get physical address takes too long

David Hotham (JIRA) jira-events at lists.jboss.org
Thu Aug 30 12:01:33 EDT 2012


    [ https://issues.jboss.org/browse/JGRP-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714915#comment-12714915 ] 

David Hotham commented on JGRP-1493:
------------------------------------

I have FD and FD_SOCK, but not FD_ALL.  D doesn't exclude B' and C', and the situation does not resolve itself.

I don't think that physical_addr_max_fetch_attempts is sufficient here:

-  I agree that this limits the time it takes to look up one member
-  But there may be more than one dead member.  So you might repeatedly fail to find physical addresses when sending MERGE_REQs, so that you still almost surely hit the merge killer.

Perhaps the simplest answer would be to use FD_ALL.  Is there any reason I wouldn't want to include that in my stack (either in addition to or in place of the other FD protocols)?
                
> Merge fails because failing to get physical address takes too long
> ------------------------------------------------------------------
>
>                 Key: JGRP-1493
>                 URL: https://issues.jboss.org/browse/JGRP-1493
>             Project: JGroups
>          Issue Type: Feature Request
>    Affects Versions: 3.1
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.2
>
>
> Start with the following views:
> -  A, B and C all have {A,B,C}
> -  D has {B', D, A, C'}, where B' and C' are dead.
> A decides to lead a merge (he's the only 'actual' coordinator).  By the time we've been through view-sanitization and so on and reached getMergeDataFromSubgroupCoordinators(), coords are {D, C', A}.
> Here A tries to send MERGE_REQ to those elements.  However, A does not have a physical address for C', and in fact nor does anyone else.  So when trying to send the MERGE_REQ to C', A will always spend a little over 5 seconds in TP.sendToSingleMember() - trying and failing to discover that physical address.
> Of course A won't get a response from C' either, so it will take another 5 seconds for merge_rsps.waitForAllResponses to time out.
> But that means that it's a sure thing that the MergeKiller will kick in first.
> Therefore the merge can never progress.  
> (Presumably the situation would be even worse if D's view had contained further dead members).
> I expect to work around this by tweaking the timings somewhere: probably in startMergeKiller, so that the MergeKiller takes longer to be scheduled.
> I'd think that the right fix would be to arrange that the MergeTask is not blocked by TP having no physical address for a member.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


More information about the jboss-jira mailing list