[jboss-jira] [JBoss JIRA] Commented: (JGRP-985) Admin Join problem -GMS flush by coordinator failed
Bulat Nigmatullin (JIRA)
jira-events at lists.jboss.org
Fri Jun 26 05:08:08 EDT 2009
[ https://jira.jboss.org/jira/browse/JGRP-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12473892#action_12473892 ]
Bulat Nigmatullin commented on JGRP-985:
----------------------------------------
I have the same issue with 2.8.0 Beta1. This is the logs of two servers (dont pay attention to log timestamps, there is a small misconception with server datetimes):
Joining server:
2009.06.26 13:03:17 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:03:24 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:03:31 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:03:38 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:03:45 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:03:52 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:03:59 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:04:06 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:04:13 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:04:20 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
2009.06.26 13:04:27 join(jabber01.rambler.ru-50481) sent to jabber02.rambler.ru-42629 timed out (after 7000 ms), retrying
Coordinator server:
2009.06.26 13:10:32 jabber02.rambler.ru-42629: GMS flush by coordinator failed
2009.06.26 13:10:47 jabber02.rambler.ru-42629: GMS flush by coordinator failed
2009.06.26 13:11:03 jabber02.rambler.ru-42629: GMS flush by coordinator failed
2009.06.26 13:11:20 jabber02.rambler.ru-42629: GMS flush by coordinator failed
2009.06.26 13:11:35 jabber02.rambler.ru-42629: GMS flush by coordinator failed
2009.06.26 13:11:37 jabber02.rambler.ru-42629: failed to collect all ACKs (expected=5) for view [jabber02.rambler.ru-42629|38] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400, jabber01.rambler.ru-48589, jabber01.rambler.ru-50481] after 2000ms, missing ACKs from [jabber15.rambler.ru-14938, jabber01.rambler.ru-48589] (received=[jabber17.rambler.ru-30400, jabber16.rambler.ru-24876, jabber02.rambler.ru-42629])
2009.06.26 13:12:58 jabber02.rambler.ru-42629: GMS flush by coordinator failed
2009.06.26 13:13:00 jabber02.rambler.ru-42629: failed to collect all ACKs (expected=5) for view [jabber02.rambler.ru-42629|39] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400, jabber01.rambler.ru-50481] after 2000ms, missing ACKs from [jabber15.rambler.ru-14938] (received=[jabber17.rambler.ru-30400, jabber16.rambler.ru-24876, jabber02.rambler.ru-42629, jabber01.rambler.ru-50481])
After two minutes from joining server start GMS was OK, but other flushes was broken:
Joining server:
2009.06.26 13:12:08 Flush failed at jabber01.rambler.ru-50481:Sessions by Hostname
2009.06.26 13:12:10 failed to collect all service ACKs (5) for [dst: <null>, src: jabber01.rambler.ru-50481 (4 headers), size=0 bytes] after 2000ms, missing ACKs from [jabber15.rambler.ru-14938] (received=[jabber17.rambler.ru-30400, jabber16.rambler.ru-24876, jabber02.rambler.ru-42629, jabber01.rambler.ru-50481]), local_addr=jabber01.rambler.ru-50481
2009.06.26 13:12:36 Flush failed at jabber01.rambler.ru-50481:Remote Users Existence
2009.06.26 13:12:38 failed to collect all service ACKs (5) for [dst: <null>, src: jabber01.rambler.ru-50481 (4 headers), size=0 bytes] after 2000ms, missing ACKs from [jabber15.rambler.ru-14938] (received=[jabber17.rambler.ru-30400, jabber16.rambler.ru-24876, jabber02.rambler.ru-42629, jabber01.rambler.ru-50481]), local_addr=jabber01.rambler.ru-50481
Member server:
2009.06.26 13:06:34 jabber15.rambler.ru-14938: discarded message from non-member jabber01.rambler.ru-50481, my view is [jabber02.rambler.ru-42629|17] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400]
2009.06.26 13:06:43 jabber15.rambler.ru-14938: discarded message from non-member jabber01.rambler.ru-50481, my view is [jabber02.rambler.ru-42629|17] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400]
2009.06.26 13:07:01 jabber15.rambler.ru-14938: discarded message from non-member jabber01.rambler.ru-50481, my view is [jabber02.rambler.ru-42629|17] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400]
2009.06.26 13:07:03 jabber15.rambler.ru-14938: discarded message from non-member jabber01.rambler.ru-50481, my view is [jabber02.rambler.ru-42629|17] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400]
2009.06.26 13:07:03 jabber15.rambler.ru-14938: discarded message from non-member jabber01.rambler.ru-50481, my view is [jabber02.rambler.ru-42629|17] [jabber02.rambler.ru-42629, jabber15.rambler.ru-14938, jabber16.rambler.ru-24876, jabber17.rambler.ru-30400]
> Admin Join problem -GMS flush by coordinator failed
> ----------------------------------------------------
>
> Key: JGRP-985
> URL: https://jira.jboss.org/jira/browse/JGRP-985
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.6.3, 2.6.4, 2.6.5, 2.7
> Environment: linux redhat 2.9.6, jdk 1.5
> Reporter: Ronn C
> Assignee: Vladimir Blagojevic
> Fix For: 2.6.11, 2.8
>
> Attachments: jgroup.tar.gz
>
>
> I am experiencing a problem with jgroups trying to join existing cluster.
>
> Occasionally, new node joining a existing cluster can experience this problem.
>
> 2009-05-21 12:04:02,568 [main] WARN org.jgroups.protocols.pbcast.GMS:144 - join(callisto.tmca.com.au-18715) sent to callisto.tmca.com.au-8185 timed out (after 3000 ms), retrying
>
> Retries can varies from a couple of times to infinitely retrying.
>
> Debugging the code, I've discovered that before join the coordinator will perform a GMS flush and unless that GMS flush success it won't reply with a join response.
>
> So sure enough at the coordinator, I see this log.
> 2009-05-21 12:05:25,902 [ViewHandler,callisto.tmca.com.au-8185] WARN org.jgroups.protocols.pbcast.GMS:749 - GMS flush by coordinator at callisto.tmca.com.au-8185 failed
>
> I've originally come across this problem in our prod environment with 2.6.3. I have been able to replicate it reliably with 2.6.3. I have tested with 2.7.0 and 2.8.0.alpha3 and retries still occurs but generally it would sort itself out within a minute. However, I've found that retries can still occurs infinitely on 2.8.0 if you keep repeating the test often enough.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list