]
Bela Ban commented on JGRP-1455:
--------------------------------
No, this doesn't work (I *knew* there was a reason) !
If we have C join {A,B} then, JChannel.connect() on C returns as soon as the JoinRsp has
been received by C, and C installs view {A,B,C}.
However, if C checks the views of A and B right after returning from connect(), they might
still be {A,B}, and not {A,B,C} as the view multicast might still be in transit !
Also, if we rely on the viewChange() callback on C to be able to invoke an RPC on {A,B,C}
(which we shouldn't !), this won't work.
Aaaah, I'll trash my (nice) change, back to square one !
Message lost in NAKACK2 due to digest error
-------------------------------------------
Key: JGRP-1455
URL:
https://issues.jboss.org/browse/JGRP-1455
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0.9
Reporter: David Hotham
Assignee: Bela Ban
Fix For: 3.0.10, 3.1
Hello,
In this issue an application-level message broadcast to the cluster is being discarded by
NAKACK2, on a new joiner.
I think I understand roughly what's going on - skip to the end for a suggested fix!
I'll keep all my trace so that I can investigate further details if needed.
So, let's start with trace from the new joiner (CFS-B-chucklebrothers), showing
that:
- it sets a digest claiming that the sequence numbers for CFS-A-tinkywinky are 26 (26)
- CFS-A-tinkywinky then sends messages with sequence numbers 26 and 27
- Only message 27 is passed upwards
{noformat}
2012-04-18 19:26:49.133 [ForkJoinPool-1-worker-3] DEBUG
org.jgroups.protocols.pbcast.NAKACK2 -
[CFS-B-chucklebrothers setDigest()]
existing digest: []
new digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)],
CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
resulting digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)],
CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE
org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3
headers), size=39 bytes], headers are SEQUENCER: WRAPPED_BCAST
(tag=[CFS-B-chucklebrothers|0]), NAKACK2: [MSG, seqno=26], TCP: [channel_name=Clumpy Test
Cluster]
2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE
org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3
headers), size=51 bytes], headers are SEQUENCER: BCAST (tag=[CFS-A-tinkywinky|11]),
NAKACK2: [MSG, seqno=27], TCP: [channel_name=Clumpy Test Cluster]
2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE
org.jgroups.protocols.pbcast.NAKACK2 - CFS-B-chucklebrothers: received
CFS-A-tinkywinky#27
{noformat}
And here's the trace from CFS-A-tinkywinky showing that:
- the digest that it sent only claimed sequence numbers 25 (25)
{noformat}
2012-04-18 19:26:49.132 [OOB-1,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE
org.jgroups.protocols.TCP - sending msg to CFS-B-chucklebrothers, src=CFS-A-tinkywinky,
headers are GMS: GmsHeader[JOIN_RSP]: join_rsp=view: [CFS-A-tinkywinky|3]
[CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers],
digest: CFS-A-tinkywinky: [25 (25)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0
(0)], CFS-B-chucklebrothers: [0 (0)], UNICAST2: DATA, seqno=2, conn_id=3, TCP:
[channel_name=Clumpy Test Cluster]
{noformat}
By looking at trace from the other members receiving message 26, I can see that this is
an application level message.
I think that the incrementing of the received sequence number is deliberate, per
ClientGmsImpl ("see doc/design/varia2.txt for details"). If I understand
correctly, it's intended to compensate for the fact that the the digest doesn't
include the broadcast VIEW message.
However, CFS-A-tinkywinky shows this:
{noformat}
2012-04-18 19:26:49.125 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] WARN
org.jgroups.protocols.pbcast.GMS - CFS-B-chucklebrothers already present; returning
existing view [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers,
CFS-B-tinkywinky, CFS-B-chucklebrothers]
2012-04-18 19:26:49.126 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE
org.jgroups.protocols.pbcast.GMS - found no members to add or remove, will not create new
view
{noformat}
My thinking is that since the coordinator does not broadcast a VIEW message, it's a
mistake for CFS-B-chucklebrothers to have fixed up the digest.
Possibly the fix is simply to remove the block of code in
CoordGmsImpl.handleMembershipChange() for "found no members to add or remove",
and send out a view anyway?
Thanks!
David
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: