[jboss-jira] [JBoss JIRA] Commented: (JGRP-985) Admin Join problem -GMS flush by coordinator failed

Wed Jun 17 22:21:56 EDT 2009

    [ https://jira.jboss.org/jira/browse/JGRP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12472620#action_12472620 ] 

Ronn C commented on JGRP-985:
-----------------------------

I'm now using 2.6.10.merge. Will be deploying to prod soon but rather nervous about it.

> having three nodes fail at the same time.

Yes I think your guess is right, something in the coordinator must be corrupted and once failure happens on one node, it will happen to other node that wants to join.

I had tried to debug the problem and I got as far as the coordinator had received JOIN request and as part of processing the JOIN request it was trying to do a flush but flush return false as a result it doesn't generate JOIN response and add the member in. 

The code that implement flush rather hard to follow and I was hoping that there would be a document that would explain to me how flush work in algorithmic form so that it would give me a bit of context to what/where I should be looking out for. Is there such a thing? If not could you explain it to me and I'll try to debug it again. 

> Admin  Join problem -GMS flush by coordinator failed
> ----------------------------------------------------
>
>                 Key: JGRP-985
>                 URL: https://jira.jboss.org/jira/browse/JGRP-985
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.3, 2.6.4, 2.6.5, 2.7
>         Environment: linux redhat 2.9.6, jdk 1.5
>            Reporter: Ronn C
>            Assignee: Vladimir Blagojevic
>             Fix For: 2.6.11, 2.8
>
>         Attachments: jgroup.tar.gz
>
>
> I am experiencing a problem with jgroups trying to join existing cluster.  
>  
> Occasionally, new node joining a existing cluster can experience this problem. 
>  
> 2009-05-21 12:04:02,568 [main] WARN org.jgroups.protocols.pbcast.GMS:144 - join(callisto.tmca.com.au-18715) sent to callisto.tmca.com.au-8185 timed out (after 3000 ms), retrying 
>  
> Retries can varies from a couple of times to infinitely retrying. 
>  
> Debugging the code, I've discovered that before join the coordinator will perform a GMS flush and unless that GMS flush success it won't reply with a join response. 
>  
> So sure enough at the coordinator, I see this log. 
> 2009-05-21 12:05:25,902 [ViewHandler,callisto.tmca.com.au-8185] WARN org.jgroups.protocols.pbcast.GMS:749 - GMS flush by coordinator at callisto.tmca.com.au-8185 failed 
>  
> I've originally come across this problem in our prod environment with 2.6.3. I have been able to replicate it reliably with 2.6.3. I have tested with 2.7.0 and 2.8.0.alpha3 and retries still occurs but generally it would sort itself out within a minute. However, I've found that retries can still occurs infinitely on 2.8.0 if you keep repeating the test often enough.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira