[jboss-jira] [JBoss JIRA] (JGRP-1451) Group gets stuck with inconsistent views

David Hotham (JIRA) jira-events at lists.jboss.org
Sun Apr 15 14:28:17 EDT 2012


David Hotham created JGRP-1451:
----------------------------------

             Summary: Group gets stuck with inconsistent views
                 Key: JGRP-1451
                 URL: https://issues.jboss.org/browse/JGRP-1451
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.0.9
            Reporter: David Hotham
            Assignee: Bela Ban


Same stress test as in JGRP-1450 etc: a group of four members, keep killing two (picked at random), expect that the group will eventually heal itself.

This one's rather a complicated sequence of events, if I've understood it correctly.  I'll do my best to explain - but do ask if something's not clear or you'd like to see more details.

*  start with everyone agreeing that the view is [C, D, B, A]
*  kill C and D
*  On seeing this, A's FD_SOCK pinger tries but fails to connect to B
**  I think this is a race where previously D was monitoring B, and now A wants to monitor B
**  B hasn't yet spotted that D has gone, and so is not ready to accept a new connection from A
**  This is a bit of a guess, but I don't think this detail is critical.
*  So now A suspects everyone else and forms a view [A].  
*  Meanwhile B only suspects C and D, so forms a view [B, A]

So far, I think, this is OK.  The two sub-groups have different coordinators, so I expect that if everything stayed static here then in due course we'd get a merge and all would be well.  

*  C and D restart.  They both join B's sub-group.
*  So now A has [A], and B, C and D all have [B, A, C, D]

Again, I think that this is still OK and should be resolved by a merge soon enough.

*  Now B and C are killed.
**  D sees that the new view would be [A, D] in which it would not be coordinator.  So it doesn't install any new view.
**  A doesn't care

I'm not sure what would happen if we left things alone now: ie whether the group would recover or not.  But in fact the stress test restarted B and C, so we go on...

*  B and C restart.  Now they both join A's subgroup (C first, as it happens).
*  So A, B and C all end up with the view [A, C, B]
*  Meanwhile D still thinks that the view is [B, A, C, D]

Now we seem to have a problem (and in my test, this is where things stopped happening):

*  A declines to lead a merge: it regularly logs "I (10.239.0.1) won't be the merge leader"
**  Presumably it is deciding that B would be a better merge leader
*  But B doesn't think that it's a coordinator, so it won't merge either.

So we're stuck, with two different views!

How is this situation expected to resolve itself?

Thanks

David

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list