[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out

Fri Mar 23 21:14:47 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679081#comment-12679081 ] 

David Hotham commented on JGRP-1426:
------------------------------------

Hi, 

I think you have misunderstood what my script does.  You say that I "sleep randomly between 0 an 10 before you span the same member" - but that's not what I'm doing at all.  I'm killing D (or E) and sleeping between 0 and 10 seconds before starting the _different_ member E (or D).  

Please be clear that the script I have submitted is a stress test, designed to demonstrate that there is a bug, and to reproduce that bug.  It is not supposed to show the exact behaviour of my actual application; it is suppposed to show that applications can hit this issue.  (And I have hit it, in real life, which is why I went to the trouble of opening the issue and writing the repro).

If it had been true that the only way to reproduce the bug was to kill D and then recreate D, then there would have been a possible workaround (ie to make sure that, once dead, D did not restart for some period of time).  However, it's surely not reasonable to suggest that a member E should not attempt to join the group too soon after some other member D has failed, is it?

Frankly, I don't understand why you're trying to change the test script to avoid the issue.  The whole point of the script is to reproduce the issue!  Wouldn't it be better to try and fix the bug?

Would it help to talk about this?  If there's a way to exchange contact details, I'd be more than happy to give you a call to discuss.

Thanks

David

> Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out
> -------------------------------------------------------------------------
>
>                 Key: JGRP-1426
>                 URL: https://issues.jboss.org/browse/JGRP-1426
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.1
>            Reporter: David Hotham
>            Assignee: Vladimir Blagojevic
>             Fix For: 3.1
>
>         Attachments: Repro23March2012.zip
>
>
> We have two sub-groups, [B, C, A] and [D].
> (1) D sends a MERGE_REQ to B.
> 2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
> (2) B receives this.  There's some delay in processing it (I think because there's another merge or flush going on; but the exact reason doesn't matter for this issue).  When processing does start, B begins a flush.
> 2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1, mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first, TCP: [channel_name=Clumpy Test Cluster]
> 2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D, merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
> (3) D times out waiting for the MERGE_RSP from B:
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> (4) D completes the (failed) merge and broadcasts STOP_FLUSH:
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
> 2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
> But, since B is not a member of D's view, B does not receive this message.  
> (5) Now all future merge attempts fail, because B is stuck in a flush:
> 2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> Note that I have implemented a workaround in my application where I:
> -  start a long-ish timer in the block() callback; and stop that timer in unblock()
> -  if the timer is allowed to pop, call channel.stopFlush()
> This seems to be allowing the group to recover.  Any comments on whether this is a good or bad idea would be appreciated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira