[jboss-jira] [JBoss JIRA] (JGRP-1485) JOIN attempts timing out indefinitely
David Hotham (JIRA)
jira-events at lists.jboss.org
Tue Jun 26 06:33:12 EDT 2012
[ https://issues.jboss.org/browse/JGRP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703288#comment-12703288 ]
David Hotham commented on JGRP-1485:
------------------------------------
This looks to be working, so I've submitted the pull request. I am seeing a follow-up issue, but I think it'll be cleaner if I raise a fresh ticket for that... watch this space.
> JOIN attempts timing out indefinitely
> -------------------------------------
>
> Key: JGRP-1485
> URL: https://issues.jboss.org/browse/JGRP-1485
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.0.10
> Reporter: David Hotham
> Assignee: Bela Ban
>
> The good news is that my testing is currently avoiding JGRP-1451 type issues. (I'm running with the latest master, plus my pull request 54).
> The bad news is, that seems to have unblocked me to find the next problem...
> I'm running the usual stress test where I kill and restart members, and verify that the group heals itself. I've managed to get into a situation where:
> - A, B, and C all have no view at all (they're all repeatedly sending JOINs that time out)
> - D has got stuck with a view {B,C,D,A,C} (in which every member except D is in fact a dead instance).
> So what's happening on each of A, B and C is:
> - perform discovery
> - decide based on information from D that the long-dead B is coordinator
> - send a JOIN to that dead B
> - this times out
> - repeat
> Meanwhile D's FD is repeatedly broadcasting that A is suspect, but no-one pays any attention.
> In an ideal world, I'd think that it ought to be up to D to spot that something has gone wrong. Eg after a long enough period of reporting that A is suspect without seeing any change of view, it could deduce that there's a problem and become a singleton; or something like that. Then a merge should sort everything out in due course.
> I'm actually experimenting with a workaround in which we only allow JOIN attempts to time out some maximum number of times; and if they time out too often the member becomes a singleton. ie I'm making a fix that allows A, B and C to proceed. Then I again expect a merge to sort everything out. This looks a lot easier to code up, and seems a plausible thing to want to do anyway.
> I have the test running and will see how this goes overnight. If it looks to work I'll submit a pull request; else I'll think again.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list