[jboss-jira] [JBoss JIRA] (JGRP-1485) JOIN attempts timing out indefinitely

Tue Jul 3 03:59:12 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704281#comment-12704281 ] 

David Hotham commented on JGRP-1485:
------------------------------------

No, D is the only member on its address.

I only ever see D suspecting A.  Here's what FD is doing (every three seconds):

{noformat}
2012-06-25 17:24:15.079 [Timer-5,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.FD - sending are-you-alive msg to 10.239.0.3 (own address=10.239.0.4)
2012-06-25 17:24:15.079 [Timer-5,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.FD - broadcasting SUSPECT message [suspected_mbrs=[10.239.0.1]] to group
0.239.0.4)
2012-06-25 17:24:15.079 [OOB-1,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.FD - [SUSPECT] suspect hdr is SUSPECT (suspected_mbrs=[10.239.0.1], from=10.239.0.4)
{noformat}

Could you point me at the bit of code where it should move on to suspecting other members, so I can take a look?

> JOIN attempts timing out indefinitely
> -------------------------------------
>
>                 Key: JGRP-1485
>                 URL: https://issues.jboss.org/browse/JGRP-1485
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.10
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.1
>
>
> The good news is that my testing is currently avoiding JGRP-1451 type issues.  (I'm running with the latest master, plus my pull request 54).  
> The bad news is, that seems to have unblocked me to find the next problem...
> I'm running the usual stress test where I kill and restart members, and verify that the group heals itself.  I've managed to get into a situation where:
> - A, B, and C all have no view at all (they're all repeatedly sending JOINs that time out)
> - D has got stuck with a view {B,C,D,A,C} (in which every member except D is in fact a dead instance).
> So what's happening on each of A, B and C is: 
> -  perform discovery
> -  decide based on information from D that the long-dead B is coordinator
> -  send a JOIN to that dead B
> -  this times out
> -  repeat
> Meanwhile D's FD is repeatedly broadcasting that A is suspect, but no-one pays any attention.
> In an ideal world, I'd think that it ought to be up to D to spot that something has gone wrong.  Eg after a long enough period of reporting that A is suspect without seeing any change of view, it could deduce that there's a problem and become a singleton; or something like that.  Then a merge should sort everything out in due course.
> I'm actually experimenting with a workaround in which we only allow JOIN attempts to time out some maximum number of times; and if they time out too often the member becomes a singleton.  ie I'm making a fix that allows A, B and C to proceed.  Then I again expect a merge to sort everything out.  This looks a lot easier to code up, and seems a plausible thing to want to do anyway.
> I have the test running and will see how this goes overnight.  If it looks to work I'll submit a pull request; else I'll think again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira