[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out

Wednesday, 4 April 2012

    [
https://issues.jboss.org/browse/JGRP-1426?page=com.atlassian.jira.plugin....
] 

David Hotham commented on JGRP-1426:
------------------------------------

Hi, 

I'm sorry you're not keen on python.  I've been working on windows where
python definitely seemd a lesser evil than a batch job; but you're right that a shell
script should do the trick just fine.

As best I understand it, it's not true that SEQUENCER / FLUSH are redundant.  (I think
we have discussed this at some point, possibly on the mailing list).  I need both for all
members to see the same sequence of (broadcast) messages and view changes, right?

-  without SEQUENCER, messages will arrive in the same view everywhere but not necessarily
the same order.  eg View1, Msg1, Msg2 might become View1, Msg2, Msg1 somewhere else
-  without FLUSH, messages might not arrive in the same view everywhere.  eg Msg1, View1,
Msg2 might become View1, Msg1, Msg2 somewhere else

(For what it's worth, I think that if views were also sequenced by SEQUENCER, I could
do without FLUSH).

However, I'm pretty sure that for the purposes of this ticket, the discussion is
academic, as I'm pretty sure that the bug doesn't require SEQUENCER.  Per my
explanations on 21/2, 22/3, and 26/3: all that the bug needs is for a STOP_FLUSH to go
missing and there's no need for SEQUENCER to be involved to trigger that.

I'll set off another run with SEQUENCER taken out and verify this.

Thanks!

David

...
 Group unable to accept new members: FLUSH stuck after MERGE_RSP timed
out
 -------------------------------------------------------------------------

                 Key: JGRP-1426
                 URL: https://issues.jboss.org/browse/JGRP-1426
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.1
            Reporter: David Hotham
            Assignee: Vladimir Blagojevic
             Fix For: 3.1

         Attachments: Hotham.java, Repro23March2012.zip

 We have two sub-groups, [B, C, A] and [D].
 (1) D sends a MERGE_REQ to B.
 2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
 (2) B receives this.  There's some delay in processing it (I think because
there's another merge or flush going on; but the exact reason doesn't matter for
this issue).  When processing does start, B begins a flush.
 2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0
bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1,
mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first,
TCP: [channel_name=Clumpy Test Cluster]
 2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D,
merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
 (3) D times out waiting for the MERGE_RSP from B:
 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
 2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all
2 partition coordinators; missing responses from 1 members, removing them from the merge
 (4) D completes the (failed) merge and broadcasts STOP_FLUSH:
 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view
MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
 2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
 2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
 But, since B is not a member of D's view, B does not receive this message.  
 (5) Now all future merge attempts fail, because B is stuck in a flush:
 2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN 
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
 2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN 
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
 2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN 
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
 Note that I have implemented a workaround in my application where I:
 -  start a long-ish timer in the block() callback; and stop that timer in unblock()
 -  if the timer is allowed to pop, call channel.stopFlush()
 This seems to be allowing the group to recover.  Any comments on whether this is a good
or bad idea would be appreciated. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1426) Group unable to accept new members: FLUSH stuck after MERGE_RSP timed out