[jboss-jira] [JBoss JIRA] (JGRP-1450) Views go wrong when two members leave simultaneously

David Hotham (JIRA) jira-events at lists.jboss.org
Sun Apr 15 08:38:17 EDT 2012


David Hotham created JGRP-1450:
----------------------------------

             Summary: Views go wrong when two members leave simultaneously
                 Key: JGRP-1450
                 URL: https://issues.jboss.org/browse/JGRP-1450
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.0.9
            Reporter: David Hotham
            Assignee: Bela Ban


Testcase essentially the same as in JGRP-1443 and JGRP-1449: ie a group of 4 members, where I simultaneously kill two at random and let them restart; and expect that the group should heal itself.  In order to rule out SEQUENCER-related issues, I've removed that from the stack.

I've got into a situation where:
-  members A, B, C see the same sequence of views and end up in a group [A, B, C]
-  but member D believes that the latest view is [C, D, A].  

I _think_ I've identified the problem.  First, here's the relevant trace (from D):

2012-04-15 10:47:37.910 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - suspected members=[10.239.0.3], suspected_mbrs=[10.239.0.3]
2012-04-15 10:47:37.961 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - suspected members=[10.239.0.2], suspected_mbrs=[10.239.0.3, 10.239.0.2]
2012-04-15 10:47:37.961 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - members are [10.239.0.3, 10.239.0.2, 10.239.0.4, 10.239.0.1], coord=10.239.0.4: I'm the new coord !
2012-04-15 10:47:38.011 [ViewHandler,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.pbcast.GMS - 10.239.0.4: new members=[], suspected=[10.239.0.2], leaving=[], new view: [10.239.0.3|629] [10.239.0.3, 10.239.0.4, 10.239.0.1]
2012-04-15 10:47:38.012 [ViewHandler,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.pbcast.GMS - 10.239.0.4: mcasting view [10.239.0.3|629] [10.239.0.3, 10.239.0.4, 10.239.0.1] (3 mbrs)

It looks to me as though what has happened is D has received separate reports that B and C are suspected, and correctly spotted that in that case he'll be coordinator in a new group [D, A].  But then when he actually becomes coordinator, he only remembers that B is suspected, so sends out a bogus view.

If this is correct, I think that the bug is in ParticipantGmsImpl.java at the end of handleMembershipChange.  I think that the final loop should be made for suspected_mbrs (before clearing ths value) and not for suspectedMembers.

Perhaps this is a bit speculative - you'll be able to tell me if I'm on the wrong track!  

I'll keep the full trace so that we can do further analysis if required; and I'll try out a fix along the lines outlined above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list