[jboss-jira] [JBoss JIRA] (JGRP-1451) Group gets stuck with inconsistent views
David Hotham (JIRA)
jira-events at lists.jboss.org
Wed May 30 10:51:18 EDT 2012
[ https://issues.jboss.org/browse/JGRP-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697216#comment-12697216 ]
David Hotham commented on JGRP-1451:
------------------------------------
Hi,
I'm using programmatic rather than XML configuration. In this testcase my stack went:
TCP
TCPPING
MERGE3
FD_SOCK
FD
NACKACK2
UNICAST2
STABLE
SEQUENCER
GMS
UFC
MFC
FRAG2
(though I've since reinstated VERIFY_SUSPECT, which I think helps to avoid the asymmetry at the start of the sequence described above so probably makes this harder to hit).
My initial thinking on this one is that D ought to be running a timer and doing some sort of disaster recovery. It knows perfectly well that the view that it has can't be right, and is expecting a new view. After that doesn't happen for some time, I suggest that it ought to be taking some sort of action: for instance becoming a singleton and starting over.
However... may I request that JGRP-1449 be treated as higher priority than this one?
- I find it much easier to hit. Of course I'd like it fixed, but I consider that JGRP-1451 is really quite an obscure case
- I've submitted a pull request for JGRP-1449 so (I hope) it should be easier to fix too
- That pull request is currently the only fix that I'm running with that's not in your codebase; I'd very much like to re-join the official stream.
> Group gets stuck with inconsistent views
> ----------------------------------------
>
> Key: JGRP-1451
> URL: https://issues.jboss.org/browse/JGRP-1451
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.0.9
> Reporter: David Hotham
> Assignee: Bela Ban
> Fix For: 3.1
>
>
> Same stress test as in JGRP-1450 etc: a group of four members, keep killing two (picked at random), expect that the group will eventually heal itself.
> This one's rather a complicated sequence of events, if I've understood it correctly. I'll do my best to explain - but do ask if something's not clear or you'd like to see more details.
> * start with everyone agreeing that the view is [C, D, B, A]
> * kill C and D
> * On seeing this, A's FD_SOCK pinger tries but fails to connect to B
> ** I think this is a race where previously D was monitoring B, and now A wants to monitor B
> ** B hasn't yet spotted that D has gone, and so is not ready to accept a new connection from A
> ** This is a bit of a guess, but I don't think this detail is critical.
> * So now A suspects everyone else and forms a view [A].
> * Meanwhile B only suspects C and D, so forms a view [B, A]
> So far, I think, this is OK. The two sub-groups have different coordinators, so I expect that if everything stayed static here then in due course we'd get a merge and all would be well.
> * C and D restart. They both join B's sub-group.
> * So now A has [A], and B, C and D all have [B, A, C, D]
> Again, I think that this is still OK and should be resolved by a merge soon enough.
> * Now B and C are killed.
> ** D sees that the new view would be [A, D] in which it would not be coordinator. So it doesn't install any new view.
> ** A doesn't care
> I'm not sure what would happen if we left things alone now: ie whether the group would recover or not. But in fact the stress test restarted B and C, so we go on...
> * B and C restart. Now they both join A's subgroup (C first, as it happens).
> * So A, B and C all end up with the view [A, C, B]
> * Meanwhile D still thinks that the view is [B, A, C, D]
> Now we seem to have a problem (and in my test, this is where things stopped happening):
> * A declines to lead a merge: it regularly logs "I (10.239.0.1) won't be the merge leader"
> ** Presumably it is deciding that B would be a better merge leader
> * But B doesn't think that it's a coordinator, so it won't merge either.
> So we're stuck, with two different views!
> How is this situation expected to resolve itself?
> Thanks
> David
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list