[jboss-jira] [JBoss JIRA] (JGRP-1485) JOIN attempts timing out indefinitely

David Hotham (JIRA) jira-events at lists.jboss.org
Mon Jun 25 14:56:12 EDT 2012


David Hotham created JGRP-1485:
----------------------------------

             Summary: JOIN attempts timing out indefinitely
                 Key: JGRP-1485
                 URL: https://issues.jboss.org/browse/JGRP-1485
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.0.10
            Reporter: David Hotham
            Assignee: Bela Ban


The good news is that my testing is currently avoiding JGRP-1451 type issues.  (I'm running with the latest master, plus my pull request 54).  

The bad news is, that seems to have unblocked me to find the next problem...

I'm running the usual stress test where I kill and restart members, and verify that the group heals itself.  I've managed to get into a situation where:

- A, B, and C all have no view at all (they're all repeatedly sending JOINs that time out)
- D has got stuck with a view {B,C,D,A,C} (in which every member except D is in fact a dead instance).

So what's happening on each of A, B and C is: 
-  perform discovery
-  decide based on information from D that the long-dead B is coordinator
-  send a JOIN to that dead B
-  this times out
-  repeat

Meanwhile D's FD is repeatedly broadcasting that A is suspect, but no-one pays any attention.

In an ideal world, I'd think that it ought to be up to D to spot that something has gone wrong.  Eg after a long enough period of reporting that A is suspect without seeing any change of view, it could deduce that there's a problem and become a singleton; or something like that.  Then a merge should sort everything out in due course.

I'm actually experimenting with a workaround in which we only allow JOIN attempts to time out some maximum number of times; and if they time out too often the member becomes a singleton.  ie I'm making a fix that allows A, B and C to proceed.  Then I again expect a merge to sort everything out.  This looks a lot easier to code up, and seems a plausible thing to want to do anyway.

I have the test running and will see how this goes overnight.  If it looks to work I'll submit a pull request; else I'll think again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list