[jboss-jira] [JBoss JIRA] (JGRP-1450) Views go wrong when two members leave simultaneously

Bela Ban (JIRA) jira-events at lists.jboss.org
Mon Apr 16 01:42:17 EDT 2012


     [ https://issues.jboss.org/browse/JGRP-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban updated JGRP-1450:
---------------------------

    Fix Version/s: 3.1

    
> Views go wrong when two members leave simultaneously
> ----------------------------------------------------
>
>                 Key: JGRP-1450
>                 URL: https://issues.jboss.org/browse/JGRP-1450
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.9
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.1
>
>
> Testcase essentially the same as in JGRP-1443 and JGRP-1449: ie a group of 4 members, where I simultaneously kill two at random and let them restart; and expect that the group should heal itself.  In order to rule out SEQUENCER-related issues, I've removed that from the stack.
> I've got into a situation where:
> -  members A, B, C see the same sequence of views and end up in a group [A, B, C]
> -  but member D believes that the latest view is [C, D, A].  
> I _think_ I've identified the problem.  First, here's the relevant trace (from D):
> 2012-04-15 10:47:37.910 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - suspected members=[10.239.0.3], suspected_mbrs=[10.239.0.3]
> 2012-04-15 10:47:37.961 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - suspected members=[10.239.0.2], suspected_mbrs=[10.239.0.3, 10.239.0.2]
> 2012-04-15 10:47:37.961 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - members are [10.239.0.3, 10.239.0.2, 10.239.0.4, 10.239.0.1], coord=10.239.0.4: I'm the new coord !
> 2012-04-15 10:47:38.011 [ViewHandler,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.pbcast.GMS - 10.239.0.4: new members=[], suspected=[10.239.0.2], leaving=[], new view: [10.239.0.3|629] [10.239.0.3, 10.239.0.4, 10.239.0.1]
> 2012-04-15 10:47:38.012 [ViewHandler,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.pbcast.GMS - 10.239.0.4: mcasting view [10.239.0.3|629] [10.239.0.3, 10.239.0.4, 10.239.0.1] (3 mbrs)
> It looks to me as though what has happened is D has received separate reports that B and C are suspected, and correctly spotted that in that case he'll be coordinator in a new group [D, A].  But then when he actually becomes coordinator, he only remembers that B is suspected, so sends out a bogus view.
> If this is correct, I think that the bug is in ParticipantGmsImpl.java at the end of handleMembershipChange.  I think that the final loop should be made for suspected_mbrs (before clearing ths value) and not for suspectedMembers.
> Perhaps this is a bit speculative - you'll be able to tell me if I'm on the wrong track!  
> I'll keep the full trace so that we can do further analysis if required; and I'll try out a fix along the lines outlined above.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list