]
Vladimir Blagojevic commented on JGRP-1426:
-------------------------------------------
Ok, lets go that route. I like your idea about final flush timeout, the second line of
defense when things go wrong for whatever reason. I'll make this change on my fork of
Jgroups and you can pick it up from there and give it a go!
Group unable to accept new members: FLUSH stuck after MERGE_RSP timed
out
-------------------------------------------------------------------------
Key: JGRP-1426
URL:
https://issues.jboss.org/browse/JGRP-1426
Project: JGroups
Issue Type: Bug
Affects Versions: 3.1
Reporter: David Hotham
Assignee: Vladimir Blagojevic
Fix For: 3.1
Attachments: Repro23March2012.zip
We have two sub-groups, [B, C, A] and [D].
(1) D sends a MERGE_REQ to B.
2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
(2) B receives this. There's some delay in processing it (I think because
there's another merge or flush going on; but the exact reason doesn't matter for
this issue). When processing does start, B begins a flush.
2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0
bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1,
mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first,
TCP: [channel_name=Clumpy Test Cluster]
2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D,
merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
(3) D times out waiting for the MERGE_RSP from B:
2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all
2 partition coordinators; missing responses from 1 members, removing them from the merge
(4) D completes the (failed) merge and broadcasts STOP_FLUSH:
2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view
MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
But, since B is not a member of D's view, B does not receive this message.
(5) Now all future merge attempts fail, because B is stuck in a flush:
2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
Note that I have implemented a workaround in my application where I:
- start a long-ish timer in the block() callback; and stop that timer in unblock()
- if the timer is allowed to pop, call channel.stopFlush()
This seems to be allowing the group to recover. Any comments on whether this is a good
or bad idea would be appreciated.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: