NAKACK: second line of defense for requested retransmissions that are not found
-------------------------------------------------------------------------------
Key: JGRP-1362
URL:
https://issues.jboss.org/browse/JGRP-1362
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 2.12.2, 3.1
When the original sender B is asked by A to retransmit message M, but doesn't have M
in its retransmission table anymore, it should tell A, or else A will send retransmission
requests to B until A or B leave.
This problem should have been fixed by JGRP-1251, but if it turns out it wasn't, then
this JIRA is (1) a second line of defense to stop the endless retransmission requests and
(2) will give us valuable diagnostic information to fix the underlying problem (should
there still be one).
Problem:
- A has a NakReceiverWindow (NRW) of 50 (highest_delivered seqno) for B
- B's NRW, however, is 200. B garbage collected messages up to 150.
- When B sends message 201, A will ask B for retransmission of [51-200]
- B will retransmit messages [150-200], but it cannot send messages 51-149, as it
doesn't have them anymore !
- A will add messages [150-200], but its NRW is still 50 (highest_delivered)
- A will continue asking B for messages [51-149] (it does have [150-201])
- This will go on forever, or until B or A leaves
SOLUTION:
- When the *original sender* B of message M receives a retransmission request for M (from
A), and it doesn't have M in its retransmission table, it should send back a
MSG_NOT_FOUND message to A including B's digest
- When A receives the MSG_NOT_FOUND message, it does the following:
- It logs it own NRW for B
- It logs B's digest
- It logs its digest history
(This information is valuable for investigating the underlying issue)
- Then A's NRW for B is adjusted:
- The highest_delivered seqno is set to B.digest.highest_delivered
- All messages in xmit_table below B.digest.highest_delivered are removed
- All retransmission tasks in the retransmitter <= B.digest.highest_delivered are
cancelled and removed
(This will stop the retransmission)
Again, this is a second line of defense, which should never be used. If the underlying
problem does occur, however, we'll have valuable information in the logs to diagnose
what went wrong.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira