[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out

Sat Feb 25 06:48:37 EST 2012

    [ https://issues.jboss.org/browse/JGRP-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671023#comment-12671023 ] 

David Hotham commented on JGRP-1426:
------------------------------------

Hi, 

Have you had a chance to look at this?  Does my repro help?

When I initially described this problem in the mailing list thread "Member cannot join group due to stuck FLUSH" I actually reported two issues, though I've been concentrating on the second.

I've now been taking another look at the first, in which the restarted D initially fails to join the group and forms a group [D].  Obviously if we don't hit partitioned groups in the first place, we're much less likely to hit this bug in merging groups.

The formation of [D] happens because D pays attention to a delayed JOIN_RSP, even though it has already timed out the JOIN_REQ and embarked on a new round of discovery.

This looks very much like [JGRP-687|https://issues.jboss.org/browse/JGRP-687], which seems to have been fixed at [aa1908f8|https://github.com/belaban/JGroups/commits/aa1908f8], but then (partially) undone at [e9c6e5e9|https://github.com/belaban/JGroups/commits/e9c6e5e9].

It's not obvious to me what that second commit was doing in ClientGmsImpl.  I don't say that it was wrong - only that it's not obvious to me!

I can confirm that undoing that change (in ClientGmsImpl only) significantly improves the behaviour of the system in the repro submitted above.

Any idea why [e9c6e5e9|https://github.com/belaban/JGroups/commits/e9c6e5e9] made that change?  Can the commented out code be safely reinstated?

> Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out
> -------------------------------------------------------------------------
>
>                 Key: JGRP-1426
>                 URL: https://issues.jboss.org/browse/JGRP-1426
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.1
>            Reporter: David Hotham
>            Assignee: Vladimir Blagojevic
>             Fix For: 3.1
>
>
> We have two sub-groups, [B, C, A] and [D].
> (1) D sends a MERGE_REQ to B.
> 2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
> (2) B receives this.  There's some delay in processing it (I think because there's another merge or flush going on; but the exact reason doesn't matter for this issue).  When processing does start, B begins a flush.
> 2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1, mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first, TCP: [channel_name=Clumpy Test Cluster]
> 2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D, merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
> (3) D times out waiting for the MERGE_RSP from B:
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> (4) D completes the (failed) merge and broadcasts STOP_FLUSH:
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
> 2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
> But, since B is not a member of D's view, B does not receive this message.  
> (5) Now all future merge attempts fail, because B is stuck in a flush:
> 2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> Note that I have implemented a workaround in my application where I:
> -  start a long-ish timer in the block() callback; and stop that timer in unblock()
> -  if the timer is allowed to pop, call channel.stopFlush()
> This seems to be allowing the group to recover.  Any comments on whether this is a good or bad idea would be appreciated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira