[jboss-jira] [JBoss JIRA] (JGRP-2167) Highest seqno is not resent nor recorded on receivers

Wed May 10 03:03:00 EDT 2017

    [ https://issues.jboss.org/browse/JGRP-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403998#comment-13403998 ] 

Bela Ban commented on JGRP-2167:
--------------------------------

The problem with setting {{resend_last_seqno_max_time}} to a high value optimizes for a case that almost never happens, and causes unnecessary traffic and thread activity (in most cases).

The last view lost will eventually be delivered, when either (1) the view sender sends another multicast or (2) STABLE kicks in. However, (1) might never happen and (2) takes time, based on STABLE's configuration.

There are ways to improve this, but I'm not sure I like any of them:
1. Have the last message sender task get acks for its highest seqno from all cluster members
2. Let the receiver continue asking the sender for retransmission until it gets that last seqno, or until higher seqnos from the sender are seen

#1 causes additional traffic that's a function of the cluster size and the frequency of sending. E.g. if a sender sends a multicast every 2 seconds, this most likely (depending on the xmit_interval config) causes another multicast to be sent (last-seqno), plus N unicast acks to be received.
This also duplicates part of the functionality of STABLE.

#2 If the last-seqno message is lost, this won't help. Also, it leads to (unicast) unnecessary traffic as well.

I think the best solution in such an edge case is to reduce the timeouts in STABLE itself and let it run its course.

> Highest seqno is not resent nor recorded on receivers
> -----------------------------------------------------
>
>                 Key: JGRP-2167
>                 URL: https://issues.jboss.org/browse/JGRP-2167
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.0.1
>            Reporter: Radim Vansa
>            Assignee: Bela Ban
>            Priority: Minor
>             Fix For: 4.0.3
>
>
> I am investigating an issue in a stress test which leads me to a situation where in a TCP-based configuration a {{GMS[VIEW]}} is broadcast to all nodes, but it is not received by some of them. Soon after that there's a {{NAKACK2.HIGHEST_SEQNO}} that causes the node that is missing the last seqno to resend it, but the retransmit is not received either. There are no further retries, and generally no NAKACK2 activity until about 30 seconds later (when another node leaves after some timeout in the test).
> The receiver does not keep asking for retransmissions until it gets them, but it seems that {{NAKACK2.handleHighestSeqno}} doesn't update {{Table.hr}} (not sure if having highest received set to non-received msg would be legal, though).
> The sender uses default value {{NAKACK2.resend_last_seqno_max_times=1}}, and as there are no further mcast messages, the highest sent seqno does not change on sender. 

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)