]
David Hotham commented on JGRP-1426:
------------------------------------
Hi,
Hmm, I wasn't really expecting you to get to that bit of the script. That would seem
to imply that:
- you don't have Popen.kill() (introduced at Python 2.6, apparently)
- and also something has gone wrong with the os.kill. Ah, it looks like I should have
gone 'import signal' at the top, and the os.kill line should read
{noformat}
os.kill(self.subProc.pid, signal.SIGKILL)
{noformat}
My apologies for the error.
Yes, it's true that it can take quite a while for the problem to reveal itself -
timing issues indeed. It probably would be best to get the script running and then let it
go for a while until you hit the issue.
While I'd very much like for you to be able to reproduce this yourself... I wonder
whether I haven't already given you enough information to start thinking about what
the right way to fix this would be. Per my previous comments, the issue is that a flush
started by B on receiving a MERGE_REQ sometimes never ends. Suggestions that come to
mind:
- after a failed merge, perhaps D (or E) should send the STOP_FLUSH to the same elements
that it sent the MERGE_REQ to. (Not, as today, broadcast that STOP_FLUSH - since that
does not go to the members of the group that D failed to join).
- That would address the explicit case that I have reported here. But I suspect there
may be other ways for the STOP_FLUSH to go missing. So perhaps it would also make sense
for B to start a long-ish timer (say 90 seconds) when it starts the flush; and if it does
not receive a STOP_FLUSH in that time then it should stop the flush anyway.
- (I see that JGRP-1025 contains some related ideas, especially its 'second line of
defense').
What do you think?
I'll look into skype but I confess I've never used it! If you'd be willing to
email me a phone number (david dot hotham at blueyonder dot co dot uk) I'd be happy to
call you the old-fashioned way.
Group unable to accept new members: FLUSH stuck after MERGE_RSP timed
out
-------------------------------------------------------------------------
Key: JGRP-1426
URL:
https://issues.jboss.org/browse/JGRP-1426
Project: JGroups
Issue Type: Bug
Affects Versions: 3.1
Reporter: David Hotham
Assignee: Vladimir Blagojevic
Fix For: 3.1
Attachments: Repro23March2012.zip
We have two sub-groups, [B, C, A] and [D].
(1) D sends a MERGE_REQ to B.
2012-02-18 22:15:03.888 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: sending MERGE_REQ to [Member-D, Member-B]
(2) B receives this. There's some delay in processing it (I think because
there's another merge or flush going on; but the exact reason doesn't matter for
this issue). When processing does start, B begins a flush.
2012-02-18 22:15:03.889 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.TCP - received [dst: Member-B, src: Member-D (3 headers), size=0
bytes, flags=OOB], headers are GMS: GmsHeader[MERGE_REQ]: merge_id=Member-D::1,
mbrs=[Member-B, Member-C, Member-A, Member-D], UNICAST2: DATA, seqno=1, conn_id=2, first,
TCP: [channel_name=Clumpy Test Cluster]
2012-02-18 22:15:08.811 [OOB-2,Clumpy Test Cluster,Member-B] TRACE
org.jgroups.protocols.pbcast.Merger - Member-B: got merge request from Member-D,
merge_id=Member-D::1, mbrs=[Member-B, Member-A, Member-C, Member-D]
(3) D times out waiting for the MERGE_RSP from B:
2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: collected 1 merge response(s) in 5001 ms
2012-02-18 22:15:08.889 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.Merger - merge leader Member-D did not get responses from all
2 partition coordinators; missing responses from 1 members, removing them from the merge
(4) D completes the (failed) merge and broadcasts STOP_FLUSH:
2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.Merger - Member-D: received all ACKs (1) for merge view
MergeView::[Member-D|1] [Member-D], subgroups=[Member-D|0] [Member-D] in 7ms
2012-02-18 22:15:08.896 [MergeTask,Clumpy Test Cluster,Member-D] TRACE
org.jgroups.protocols.pbcast.GMS - Member-D: sending RESUME event
2012-02-18 22:15:08.897 [MergeTask,Clumpy Test Cluster,Member-D] DEBUG
org.jgroups.protocols.pbcast.FLUSH - Member-D: received RESUME, sending STOP_FLUSH to all
But, since B is not a member of D's view, B does not receive this message.
(5) Now all future merge attempts fail, because B is stuck in a flush:
2012-02-18 22:15:31.186 [OOB-1,Clumpy Test Cluster,Member-B] WARN
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
2012-02-18 22:15:54.380 [OOB-2,Clumpy Test Cluster,Member-B] WARN
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
2012-02-18 22:16:13.705 [OOB-1,Clumpy Test Cluster,Member-B] WARN
org.jgroups.protocols.pbcast.GMS - Member-B: GMS flush by coordinator failed
Note that I have implemented a workaround in my application where I:
- start a long-ish timer in the block() callback; and stop that timer in unblock()
- if the timer is allowed to pop, call channel.stopFlush()
This seems to be allowing the group to recover. Any comments on whether this is a good
or bad idea would be appreciated.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: