[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out

Wed Mar 21 16:06:49 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678448#comment-12678448 ] 

David Hotham edited comment on JGRP-1426 at 3/21/12 4:05 PM:
-------------------------------------------------------------

Hi, 

The python code looks to be failing trying to execute the java program.  In my repro I'd built the code into a jar, so the command in that subprocess.Popen() line is appropriate.  You should just need to change the string so that it contains the right command line to execute the java code in your environment.

I can add a longer delay to the repro if you like, and I expect that it will make the problem be hit less frequently.  I'm not sure what this would prove, though.  I admit that my repro script is a little artificial - but that's because it's designed to demonstrate the window condition that causes this bug.  Certainly if I don't try to hit the bug, I'm less likely to hit it...!

Anyway, probably the best thing would be for you to be able to reproduce this yourself.  Let me know if you have any more trouble with the script.

David

PS I have submitted a pull request reinstating the JGRP-687 fix which, as above, definitely does make this bug (JGRP-1426) harder to hit.  So far it has neither been accepted nor rejected.  Do you have any insight as to what the intentions are for that pull request?

      was (Author: dimbleby):
    Hi, 

The python code looks to be failing trying to execute the java program.  In my repro I'd built the code into a jar, so the command in that subprocess.Popen() line is appropriate.  You should just need to change the string so that it contains the right command line to execute the java code in your environment.

I can add a longer delay to the repro if you like, and I expect that it will make the problem be hit less frequently.  I'm not sure what this would prove, though.  I admit that my repro script is a little artificial - but that's because it's designed to demonstrate the window condition that causes this bug.  Certainly if I don't try to it the bug, I'm less likely to hit it...!

Anyway, probably the best thing would be for you to be able to reproduce this yourself.  Let me know if you have any more trouble with the script.

David

PS I have submitted a pull request reinstating the JGRP-687 fix which, as above, definitely does make this bug (JGRP-1426) harder to hit.  So far it has neither been accepted nor rejected.  Do you have any insight as to what the intentions are for that pull request?

> Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out
> -------------------------------------------------------------------------
>
>                 Key: JGRP-1426
>                 URL: https://issues.jboss.org/browse/JGRP-1426
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.1
>            Reporter: David Hotham
>            Assignee: Vladimir Blagojevic
>             Fix For: 3.1
>
>
> We have two sub-groups, [B, C, A] and [D].
> (1) D sends a MERGE_REQ to B.
> 2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
> (2) B receives this.  There's some delay in processing it (I think because there's another merge or flush going on; but the exact reason doesn't matter for this issue).  When processing does start, B begins a flush.
> 2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1, mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first, TCP: [channel_name=Clumpy Test Cluster]
> 2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D, merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
> (3) D times out waiting for the MERGE_RSP from B:
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
> 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> (4) D completes the (failed) merge and broadcasts STOP_FLUSH:
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
> 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
> 2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
> But, since B is not a member of D's view, B does not receive this message.  
> (5) Now all future merge attempts fail, because B is stuck in a flush:
> 2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> 2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN  org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
> Note that I have implemented a workaround in my application where I:
> -  start a long-ish timer in the block() callback; and stop that timer in unblock()
> -  if the timer is allowed to pop, call channel.stopFlush()
> This seems to be allowing the group to recover.  Any comments on whether this is a good or bad idea would be appreciated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira