]
David Hotham updated JGRP-1451:
-------------------------------
Attachment: 1451Repro.zip
I don't have a nice simple unit test repro. But I'm attaching the stress test
that I used to discover this issue.
It's very much in the spirit of previous repros that I've submitted: a Python
script running four JGroups nodes, and killing some of them every now and again.
The code has gotten a little more complicated since I last submitted one of these, mostly
because I've added proper tracing to the nodes.
Let me know if you have any trouble getting this running.
Group gets stuck with inconsistent views
----------------------------------------
Key: JGRP-1451
URL:
https://issues.jboss.org/browse/JGRP-1451
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.9
Reporter: David Hotham
Assignee: Bela Ban
Fix For: 3.1
Attachments: 1451Repro.zip
Same stress test as in JGRP-1450 etc: a group of four members, keep killing two (picked
at random), expect that the group will eventually heal itself.
This one's rather a complicated sequence of events, if I've understood it
correctly. I'll do my best to explain - but do ask if something's not clear or
you'd like to see more details.
* start with everyone agreeing that the view is [C, D, B, A]
* kill C and D
* On seeing this, A's FD_SOCK pinger tries but fails to connect to B
** I think this is a race where previously D was monitoring B, and now A wants to
monitor B
** B hasn't yet spotted that D has gone, and so is not ready to accept a new
connection from A
** This is a bit of a guess, but I don't think this detail is critical.
* So now A suspects everyone else and forms a view [A].
* Meanwhile B only suspects C and D, so forms a view [B, A]
So far, I think, this is OK. The two sub-groups have different coordinators, so I expect
that if everything stayed static here then in due course we'd get a merge and all
would be well.
* C and D restart. They both join B's sub-group.
* So now A has [A], and B, C and D all have [B, A, C, D]
Again, I think that this is still OK and should be resolved by a merge soon enough.
* Now B and C are killed.
** D sees that the new view would be [A, D] in which it would not be coordinator. So it
doesn't install any new view.
** A doesn't care
I'm not sure what would happen if we left things alone now: ie whether the group
would recover or not. But in fact the stress test restarted B and C, so we go on...
* B and C restart. Now they both join A's subgroup (C first, as it happens).
* So A, B and C all end up with the view [A, C, B]
* Meanwhile D still thinks that the view is [B, A, C, D]
Now we seem to have a problem (and in my test, this is where things stopped happening):
* A declines to lead a merge: it regularly logs "I (10.239.0.1) won't be the
merge leader"
** Presumably it is deciding that B would be a better merge leader
* But B doesn't think that it's a coordinator, so it won't merge either.
So we're stuck, with two different views!
How is this situation expected to resolve itself?
Thanks
David
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: