[jboss-jira] [JBoss JIRA] (JGRP-1455) Message lost in NAKACK2 due to digest error

Bela Ban (JIRA) jira-events at lists.jboss.org
Wed May 9 06:53:17 EDT 2012


    [ https://issues.jboss.org/browse/JGRP-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12691322#comment-12691322 ] 

Bela Ban commented on JGRP-1455:
--------------------------------

I guess the better (because cleaner) approach is to send the JoinRsp to the joiners *before* broadcasting the view, because then the joiners *will* receive the view and install it, so there is no retransmission needed.
I recall though that there was a reason why I broadcast the view *before* returning the JoinRsp... 
                
> Message lost in NAKACK2 due to digest error
> -------------------------------------------
>
>                 Key: JGRP-1455
>                 URL: https://issues.jboss.org/browse/JGRP-1455
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 3.0.9
>            Reporter: David Hotham
>            Assignee: Bela Ban
>             Fix For: 3.0.10, 3.1
>
>
> Hello,
> In this issue an application-level message broadcast to the cluster is being discarded by NAKACK2, on a new joiner.
> I think I understand roughly what's going on - skip to the end for a suggested fix!
> I'll keep all my trace so that I can investigate further details if needed.
> So, let's start with trace from the new joiner (CFS-B-chucklebrothers), showing that:
> -  it sets a digest claiming that the sequence numbers for CFS-A-tinkywinky are 26 (26)
> -  CFS-A-tinkywinky then sends messages with sequence numbers 26 and 27
> -  Only message 27 is passed upwards
> {noformat}
> 2012-04-18 19:26:49.133 [ForkJoinPool-1-worker-3] DEBUG org.jgroups.protocols.pbcast.NAKACK2 -
> [CFS-B-chucklebrothers setDigest()]
> existing digest:  []
> new digest:       CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> resulting digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=39 bytes], headers are SEQUENCER: WRAPPED_BCAST (tag=[CFS-B-chucklebrothers|0]), NAKACK2: [MSG, seqno=26], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=51 bytes], headers are SEQUENCER: BCAST (tag=[CFS-A-tinkywinky|11]), NAKACK2: [MSG, seqno=27], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.pbcast.NAKACK2 - CFS-B-chucklebrothers: received CFS-A-tinkywinky#27
> {noformat}
> And here's the trace from CFS-A-tinkywinky showing that:
> -  the digest that it sent only claimed sequence numbers 25 (25)
> {noformat}
> 2012-04-18 19:26:49.132 [OOB-1,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.TCP - sending msg to CFS-B-chucklebrothers, src=CFS-A-tinkywinky, headers are GMS: GmsHeader[JOIN_RSP]: join_rsp=view: [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers], digest: CFS-A-tinkywinky: [25 (25)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)], UNICAST2: DATA, seqno=2, conn_id=3, TCP: [channel_name=Clumpy Test Cluster]
> {noformat}
> By looking at trace from the other members receiving message 26, I can see that this is an application level message.
> I think that the incrementing of the received sequence number is deliberate, per ClientGmsImpl ("see doc/design/varia2.txt for details").  If I understand correctly, it's intended to compensate for the fact that the the digest doesn't include the broadcast VIEW message.
> However, CFS-A-tinkywinky shows this:
> {noformat}
> 2012-04-18 19:26:49.125 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] WARN  org.jgroups.protocols.pbcast.GMS - CFS-B-chucklebrothers already present; returning existing view [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers]
> 2012-04-18 19:26:49.126 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.pbcast.GMS - found no members to add or remove, will not create new view
> {noformat}
> My thinking is that since the coordinator does not broadcast a VIEW message, it's a mistake for CFS-B-chucklebrothers to have fixed up the digest.
> Possibly the fix is simply to remove the block of code in CoordGmsImpl.handleMembershipChange() for "found no members to add or remove", and send out a view anyway?
> Thanks!
> David

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list