[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out

Tuesday, 21 February 2012

    [
https://issues.jboss.org/browse/JGRP-1426?page=com.atlassian.jira.plugin....
] 

David Hotham commented on JGRP-1426:
------------------------------------

As I reported on the mailing list, I'm reproducing this using my full application -
which I'm afraid I'm reluctant to send you.  (Besides, it's written in Scala,
which you "don't do" ;-) 

If it's really essential then I can try to set up a more self-contained test, but
I'm hoping that my description is clear enough that this won't be needed. 
Certainly I feel as though I understand what has gone wrong in the flows (if not how to
fix it!) so if I haven't convinced you then that's a failure of my communication
(or I'm completely wrong).

I can certainly send you more trace from my own repro, if that would help; and I'd be
happy to run with a version of JGroups that included any extra trace you want to add.

Having said that, my testcase really isn't anything more than: set up a group of four
members, kill one, and restart it.  There are also some broadcast message flows from the
application when it sees a change of view - I guess this probably makes it more likely
that there's non-trivial flushing going on.

It seems to be quite a tough timing window to hit.  I speculate that we need to be
relatively slow to recognise that old-D is dead, so that when you get the MERGE_REQ from
new-D you're still doing a flush associated with losing old-D.  Certainly - as in the
trace above - you need to be slow to deal with new-D's MERGE_REQ at B, so that it is
timed out.

Does that all make sense?  Do let me know if I can help out any more.

...
 Group unable to accept new members: FLUSH stuck after MERGE_RSP timed
out
 -------------------------------------------------------------------------

                 Key: JGRP-1426
                 URL: https://issues.jboss.org/browse/JGRP-1426
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.1
            Reporter: David Hotham
            Assignee: Vladimir Blagojevic
             Fix For: 3.1

 We have two sub-groups, [B, C, A] and [D].
 (1) D sends a MERGE_REQ to B.
 2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
 (2) B receives this.  There's some delay in processing it (I think because
there's another merge or flush going on; but the exact reason doesn't matter for
this issue).  When processing does start, B begins a flush.
 2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0
bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1,
mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first,
TCP: [channel_name=Clumpy Test Cluster]
 2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D,
merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
 (3) D times out waiting for the MERGE_RSP from B:
 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all
2 partition coordinators; missing responses from 1 members, removing them from the merge
 (4) D completes the (failed) merge and broadcasts STOP_FLUSH:
 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view
MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
 2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
 But, since B is not a member of D's view, B does not receive this message.  
 (5) Now all future merge attempts fail, because B is stuck in a flush:
 2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN 
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
 2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN 
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
 2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN 
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
 Note that I have implemented a workaround in my application where I:
 -  start a long-ish timer in the block() callback; and stop that timer in unblock()
 -  if the timer is allowed to pop, call channel.stopFlush()
 This seems to be allowing the group to recover.  Any comments on whether this is a good
or bad idea would be appreciated. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out