[
https://issues.redhat.com/browse/JGRP-2486?page=com.atlassian.jira.plugin...
]
lukas brandl commented on JGRP-2486:
------------------------------------
Thank you for the quick reply and the recommendations.
To clarify: the dead node is never suspected by the surviving node in this case.
The thread getting stuck on the bundler (or the tcp connection) isn’t a problem in itself
but appears to be the reason why the the node is never suspected.
We are aware that there are newer alternatives to FD, but we can’t easily change the
protocol stack if this causes incompatibility with previous versions and therefore can’t
be upgraded in a rolling fashion.
FD Monitor get stuck on TrasferQueueBundler
-------------------------------------------
Key: JGRP-2486
URL:
https://issues.redhat.com/browse/JGRP-2486
Project: JGroups
Issue Type: Bug
Affects Versions: 4.0.22
Reporter: lukas brandl
Assignee: Bela Ban
Priority: Major
Attachments: Main.java, stack-trace.txt
Apparently there is an issue in the FD protocol. When a cluster nodes is disconnected and
the disconnect isn't handled by FD_SOCK, FD stops sending heartbeats after a while.
This only happens when the queue of the TrasferQueueBundler fills up before the node is
suspected.
The stack trace shows that the FD$Monitor is blocked by the bundler. This is probably the
reason why the heartbeat timeouts are not noticed.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)