[jboss-jira] [JBoss JIRA] (JGRP-1485) JOIN attempts timing out indefinitely

Tue Jul 3 02:34:12 EDT 2012

     [ https://issues.jboss.org/browse/JGRP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban resolved JGRP-1485.
----------------------------

    Resolution: Done

GMS.max_join_attempts was added, but we should investigate why D didn't end up with a singleton view. I ran tests with both FD and FD_ALL, and this always worked.

> JOIN attempts timing out indefinitely
> -------------------------------------
>
>                 Key: JGRP-1485
>                 URL: https://issues.jboss.org/browse/JGRP-1485
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.10
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.1
>
>
> The good news is that my testing is currently avoiding JGRP-1451 type issues.  (I'm running with the latest master, plus my pull request 54).  
> The bad news is, that seems to have unblocked me to find the next problem...
> I'm running the usual stress test where I kill and restart members, and verify that the group heals itself.  I've managed to get into a situation where:
> - A, B, and C all have no view at all (they're all repeatedly sending JOINs that time out)
> - D has got stuck with a view {B,C,D,A,C} (in which every member except D is in fact a dead instance).
> So what's happening on each of A, B and C is: 
> -  perform discovery
> -  decide based on information from D that the long-dead B is coordinator
> -  send a JOIN to that dead B
> -  this times out
> -  repeat
> Meanwhile D's FD is repeatedly broadcasting that A is suspect, but no-one pays any attention.
> In an ideal world, I'd think that it ought to be up to D to spot that something has gone wrong.  Eg after a long enough period of reporting that A is suspect without seeing any change of view, it could deduce that there's a problem and become a singleton; or something like that.  Then a merge should sort everything out in due course.
> I'm actually experimenting with a workaround in which we only allow JOIN attempts to time out some maximum number of times; and if they time out too often the member becomes a singleton.  ie I'm making a fix that allows A, B and C to proceed.  Then I again expect a merge to sort everything out.  This looks a lot easier to code up, and seems a plausible thing to want to do anyway.
> I have the test running and will see how this goes overnight.  If it looks to work I'll submit a pull request; else I'll think again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira