[jboss-jira] [JBoss JIRA] (JGRP-1455) Message lost in NAKACK2 due to digest error

Wed May 9 10:53:18 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12691442#comment-12691442 ] 

Bela Ban commented on JGRP-1455:
--------------------------------

I 'fixed' this as follows:
- A has digest A:5,B:0 (B is the joiner)
- A broadcasts view V, now its seqno is A:6
- A sends the JoinRsp and digest to B
- B installs A:5
- B immediately requests retransmission of A:6 // FIX (or kludge if you want to call it that ! :-))
- A broadcasts A:7
- B receives A:7, stores it
- B receives (the retransmitted) A:6
- B delivers A:6 and A:7

Hmm, this covers only the very basic use case, but if A had broadcast A:6, A:7 and V (A:8) *before* sending the JoinRsp to B, B would request A:6, but would still have to wait for the retransmission task to ask for retransmission of A:7 and A:8 !

Perhaps A should send its *actual* highest received seqno (e.g. A:8) with the Digest, and so B could request retransmission of A:6-8. However, there is always a window (between getting the actual seqno and B installing it and receiving messages from now on) in which A can continue sending messages, and B can only get missed messages by retransmission...

> Message lost in NAKACK2 due to digest error
> -------------------------------------------
>
>                 Key: JGRP-1455
>                 URL: https://issues.jboss.org/browse/JGRP-1455
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.9
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.0.10, 3.1
>
>
> Hello,
> In this issue an application-level message broadcast to the cluster is being discarded by NAKACK2, on a new joiner.
> I think I understand roughly what's going on - skip to the end for a suggested fix!
> I'll keep all my trace so that I can investigate further details if needed.
> So, let's start with trace from the new joiner (CFS-B-chucklebrothers), showing that:
> -  it sets a digest claiming that the sequence numbers for CFS-A-tinkywinky are 26 (26)
> -  CFS-A-tinkywinky then sends messages with sequence numbers 26 and 27
> -  Only message 27 is passed upwards
> {noformat}
> 2012-04-18 19:26:49.133 [ForkJoinPool-1-worker-3] DEBUG org.jgroups.protocols.pbcast.NAKACK2 -
> [CFS-B-chucklebrothers setDigest()]
> existing digest:  []
> new digest:       CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> resulting digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=39 bytes], headers are SEQUENCER: WRAPPED_BCAST (tag=[CFS-B-chucklebrothers|0]), NAKACK2: [MSG, seqno=26], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=51 bytes], headers are SEQUENCER: BCAST (tag=[CFS-A-tinkywinky|11]), NAKACK2: [MSG, seqno=27], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.pbcast.NAKACK2 - CFS-B-chucklebrothers: received CFS-A-tinkywinky#27
> {noformat}
> And here's the trace from CFS-A-tinkywinky showing that:
> -  the digest that it sent only claimed sequence numbers 25 (25)
> {noformat}
> 2012-04-18 19:26:49.132 [OOB-1,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.TCP - sending msg to CFS-B-chucklebrothers, src=CFS-A-tinkywinky, headers are GMS: GmsHeader[JOIN_RSP]: join_rsp=view: [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers], digest: CFS-A-tinkywinky: [25 (25)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)], UNICAST2: DATA, seqno=2, conn_id=3, TCP: [channel_name=Clumpy Test Cluster]
> {noformat}
> By looking at trace from the other members receiving message 26, I can see that this is an application level message.
> I think that the incrementing of the received sequence number is deliberate, per ClientGmsImpl ("see doc/design/varia2.txt for details").  If I understand correctly, it's intended to compensate for the fact that the the digest doesn't include the broadcast VIEW message.
> However, CFS-A-tinkywinky shows this:
> {noformat}
> 2012-04-18 19:26:49.125 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] WARN  org.jgroups.protocols.pbcast.GMS - CFS-B-chucklebrothers already present; returning existing view [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers]
> 2012-04-18 19:26:49.126 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.pbcast.GMS - found no members to add or remove, will not create new view
> {noformat}
> My thinking is that since the coordinator does not broadcast a VIEW message, it's a mistake for CFS-B-chucklebrothers to have fixed up the digest.
> Possibly the fix is simply to remove the block of code in CoordGmsImpl.handleMembershipChange() for "found no members to add or remove", and send out a view anyway?
> Thanks!
> David

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira