[jboss-jira] [JBoss JIRA] (JGRP-1455) Message lost in NAKACK2 due to digest error
Bela Ban (JIRA)
jira-events at lists.jboss.org
Mon Apr 23 03:38:29 EDT 2012
[ https://issues.jboss.org/browse/JGRP-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12686445#comment-12686445 ]
Bela Ban commented on JGRP-1455:
--------------------------------
The kludge with incrementing the coord's seqno probably doesn't work anyway:
- A (the coord) gets its digest (A:5); A's next seqno to be expected is #6
- A multicasts application messages A:6 and A:7
- A multicasts the new view for D (message A:8)
- A unicasts the join response to D with digest A:5)
- D increments A:5 to A:6
--> A:6 was an application message, and not the view, so this is clearly wrong !
--> Perhaps returning A:5 in the digest, *not* incrementing it at the joiner and silently discarding the duplicate view (as suggested above) is the solution to this.
> Message lost in NAKACK2 due to digest error
> -------------------------------------------
>
> Key: JGRP-1455
> URL: https://issues.jboss.org/browse/JGRP-1455
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.0.9
> Reporter: David Hotham
> Assignee: Bela Ban
> Fix For: 3.0.10, 3.1
>
>
> Hello,
> In this issue an application-level message broadcast to the cluster is being discarded by NAKACK2, on a new joiner.
> I think I understand roughly what's going on - skip to the end for a suggested fix!
> I'll keep all my trace so that I can investigate further details if needed.
> So, let's start with trace from the new joiner (CFS-B-chucklebrothers), showing that:
> - it sets a digest claiming that the sequence numbers for CFS-A-tinkywinky are 26 (26)
> - CFS-A-tinkywinky then sends messages with sequence numbers 26 and 27
> - Only message 27 is passed upwards
> {noformat}
> 2012-04-18 19:26:49.133 [ForkJoinPool-1-worker-3] DEBUG org.jgroups.protocols.pbcast.NAKACK2 -
> [CFS-B-chucklebrothers setDigest()]
> existing digest: []
> new digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> resulting digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=39 bytes], headers are SEQUENCER: WRAPPED_BCAST (tag=[CFS-B-chucklebrothers|0]), NAKACK2: [MSG, seqno=26], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=51 bytes], headers are SEQUENCER: BCAST (tag=[CFS-A-tinkywinky|11]), NAKACK2: [MSG, seqno=27], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.pbcast.NAKACK2 - CFS-B-chucklebrothers: received CFS-A-tinkywinky#27
> {noformat}
> And here's the trace from CFS-A-tinkywinky showing that:
> - the digest that it sent only claimed sequence numbers 25 (25)
> {noformat}
> 2012-04-18 19:26:49.132 [OOB-1,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.TCP - sending msg to CFS-B-chucklebrothers, src=CFS-A-tinkywinky, headers are GMS: GmsHeader[JOIN_RSP]: join_rsp=view: [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers], digest: CFS-A-tinkywinky: [25 (25)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)], UNICAST2: DATA, seqno=2, conn_id=3, TCP: [channel_name=Clumpy Test Cluster]
> {noformat}
> By looking at trace from the other members receiving message 26, I can see that this is an application level message.
> I think that the incrementing of the received sequence number is deliberate, per ClientGmsImpl ("see doc/design/varia2.txt for details"). If I understand correctly, it's intended to compensate for the fact that the the digest doesn't include the broadcast VIEW message.
> However, CFS-A-tinkywinky shows this:
> {noformat}
> 2012-04-18 19:26:49.125 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] WARN org.jgroups.protocols.pbcast.GMS - CFS-B-chucklebrothers already present; returning existing view [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers]
> 2012-04-18 19:26:49.126 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.pbcast.GMS - found no members to add or remove, will not create new view
> {noformat}
> My thinking is that since the coordinator does not broadcast a VIEW message, it's a mistake for CFS-B-chucklebrothers to have fixed up the digest.
> Possibly the fix is simply to remove the block of code in CoordGmsImpl.handleMembershipChange() for "found no members to add or remove", and send out a view anyway?
> Thanks!
> David
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list