[jboss-jira] [JBoss JIRA] (JGRP-1455) Message lost in NAKACK2 due to digest error
David Hotham (JIRA)
jira-events at lists.jboss.org
Wed May 9 04:39:18 EDT 2012
[ https://issues.jboss.org/browse/JGRP-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12691254#comment-12691254 ]
David Hotham commented on JGRP-1455:
------------------------------------
Hi,
Have you been able to convince yourself that your proposed fix is good yet? It has been holding up well in my testing.
While I have local fixes for nearly all of the issues that I've opened (per the series of pull requests), I would very much prefer to be running with a version of JGroups that has been formally released. Can you say anything about if / when you expect to take a look at JGRP-1449, JGRP-1451, JGRP-1452, this issue, and JGRP-1458?
I do understand that a man who buys free software has no special claim on your time - and I'm very happy with the help you've been able to give me so far! It would just be helpful to my own planning to know what your intentions are.
Thanks!
> Message lost in NAKACK2 due to digest error
> -------------------------------------------
>
> Key: JGRP-1455
> URL: https://issues.jboss.org/browse/JGRP-1455
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 3.0.9
> Reporter: David Hotham
> Assignee: Bela Ban
> Fix For: 3.0.10, 3.1
>
>
> Hello,
> In this issue an application-level message broadcast to the cluster is being discarded by NAKACK2, on a new joiner.
> I think I understand roughly what's going on - skip to the end for a suggested fix!
> I'll keep all my trace so that I can investigate further details if needed.
> So, let's start with trace from the new joiner (CFS-B-chucklebrothers), showing that:
> - it sets a digest claiming that the sequence numbers for CFS-A-tinkywinky are 26 (26)
> - CFS-A-tinkywinky then sends messages with sequence numbers 26 and 27
> - Only message 27 is passed upwards
> {noformat}
> 2012-04-18 19:26:49.133 [ForkJoinPool-1-worker-3] DEBUG org.jgroups.protocols.pbcast.NAKACK2 -
> [CFS-B-chucklebrothers setDigest()]
> existing digest: []
> new digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> resulting digest: CFS-A-tinkywinky: [26 (26)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=39 bytes], headers are SEQUENCER: WRAPPED_BCAST (tag=[CFS-B-chucklebrothers|0]), NAKACK2: [MSG, seqno=26], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.TCP - received [dst: <null>, src: CFS-A-tinkywinky (3 headers), size=51 bytes], headers are SEQUENCER: BCAST (tag=[CFS-A-tinkywinky|11]), NAKACK2: [MSG, seqno=27], TCP: [channel_name=Clumpy Test Cluster]
> 2012-04-18 19:26:49.200 [Incoming-2,Clumpy Test Cluster,CFS-B-chucklebrothers] TRACE org.jgroups.protocols.pbcast.NAKACK2 - CFS-B-chucklebrothers: received CFS-A-tinkywinky#27
> {noformat}
> And here's the trace from CFS-A-tinkywinky showing that:
> - the digest that it sent only claimed sequence numbers 25 (25)
> {noformat}
> 2012-04-18 19:26:49.132 [OOB-1,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.TCP - sending msg to CFS-B-chucklebrothers, src=CFS-A-tinkywinky, headers are GMS: GmsHeader[JOIN_RSP]: join_rsp=view: [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers], digest: CFS-A-tinkywinky: [25 (25)], CFS-A-chucklebrothers: [0 (0)], CFS-B-tinkywinky: [0 (0)], CFS-B-chucklebrothers: [0 (0)], UNICAST2: DATA, seqno=2, conn_id=3, TCP: [channel_name=Clumpy Test Cluster]
> {noformat}
> By looking at trace from the other members receiving message 26, I can see that this is an application level message.
> I think that the incrementing of the received sequence number is deliberate, per ClientGmsImpl ("see doc/design/varia2.txt for details"). If I understand correctly, it's intended to compensate for the fact that the the digest doesn't include the broadcast VIEW message.
> However, CFS-A-tinkywinky shows this:
> {noformat}
> 2012-04-18 19:26:49.125 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] WARN org.jgroups.protocols.pbcast.GMS - CFS-B-chucklebrothers already present; returning existing view [CFS-A-tinkywinky|3] [CFS-A-tinkywinky, CFS-A-chucklebrothers, CFS-B-tinkywinky, CFS-B-chucklebrothers]
> 2012-04-18 19:26:49.126 [ViewHandler,Clumpy Test Cluster,CFS-A-tinkywinky] TRACE org.jgroups.protocols.pbcast.GMS - found no members to add or remove, will not create new view
> {noformat}
> My thinking is that since the coordinator does not broadcast a VIEW message, it's a mistake for CFS-B-chucklebrothers to have fixed up the digest.
> Possibly the fix is simply to remove the block of code in CoordGmsImpl.handleMembershipChange() for "found no members to add or remove", and send out a view anyway?
> Thanks!
> David
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list