[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out

Thu Apr 5 09:39:50 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682062#comment-12682062 ] 

Bela Ban commented on JGRP-1426:
--------------------------------

Hi David,

Re your comment on 04/Apr/12 4:15 PM:

If you need all messages to be delivered in the same order everywhere, and views as well, then all you need is SEQUENCER !

A view is genererated by GMS and then broadcast to the current cluster (and unicast to the joiner), and that broadcast is totally ordered with respect to the other broadcasts in the cluster.

So IMO you don't need FLUSH at all ! I suggest try this out and see whether you get the results you want.

Re FLUSH: I'dstill like to fix this, as it seems to be a bug, so Vladimir can you look into that state FLUSH gets into when the STOP-FLUSH is missing ?

David, would you like to discuss this on the phone ?

> Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out
> -------------------------------------------------------------------------
>
>                 Key: JGRP-1426
>                 URL: https://issues.jboss.org/browse/JGRP-1426
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.1
>            Reporter: David Hotham
>            Assignee: Vladimir Blagojevic
>             Fix For: 3.1
>
>         Attachments: Hotham.java, Repro23March2012.zip
>
>
> We have two sub-groups, [B, C, A] and [D].
> (1) D sends a MERGE_REQ to B.
> 2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
> (2) B receives this.  There's some delay in processing it (I think because there's another merge or flush going on; but the exact reason doesn't matter for this issue).  When processing does start, B begins a flush.
> 2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1, mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first, TCP: [channel_name=Clumpy Test Cluster]
> 2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D, merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
> (3) D times out waiting for the MERGE_RSP from B:
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> (4) D completes the (failed) merge and broadcasts STOP_FLUSH:
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
> 2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
> But, since B is not a member of D's view, B does not receive this message.  
> (5) Now all future merge attempts fail, because B is stuck in a flush:
> 2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> Note that I have implemented a workaround in my application where I:
> -  start a long-ish timer in the block() callback; and stop that timer in unblock()
> -  if the timer is allowed to pop, call channel.stopFlush()
> This seems to be allowing the group to recover.  Any comments on whether this is a good or bad idea would be appreciated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira