[JBoss JIRA] (JGRP-2463) TransferQueueBundler: Message to stopped node blocks the bundler thread

Wednesday, 1 April 2020

    [
https://issues.redhat.com/browse/JGRP-2463?page=com.atlassian.jira.plugin...
] 

Dan Berindei commented on JGRP-2463:
------------------------------------

...
 The log snippet below shows that the connection attempt to the left
member takes 4ms, so this should not be an issue: 
Oops, I should have checked the logs! I was convinced there was no ICMP error message in
their test because they're killing the container, not just the server process.

I now have another theory: each {{TransferQueueBundler.run()}} iteration drains the entire
contents of the queue into {{remove_queue}}, then tries to send the messages one by one.
If there's an exception (e.g. {{java.net.ConnectException}}) sending any of those
messages, it's only caught at the end of the iteration, and the next iteration drops
all the unsent messages with {{removed_queue.clear()}}.

Since {{UNICAST3}} resends the last message to the missing node every
{{UNICAST3.xmit_interval}} ms, some messages could be dropped more than once, leading to
total latencies much higher than {{UNICAST3.xmit_interval}}.

...
 Can this be reproduced? 
I assume the failure can be reproduced by the KeyCloak team, although they haven't
added any more comments to KEYCLOAK-13310

...
 We could experiment with a bundler that has 1 queue for destination
(and 1 associated thread dequeuing), and RED dropping messages before/when the queue gets
full. However, this is too complicated a change... 
That's what I had in mind, in fact adding a comment to JGRP-2462 was my main
motivation to open this JIRA :)

...
 I think we should use TCP_NIO2 for scenarios in which TCP writes can
block. I guess I should move JGRP-2108 up... wdyt? 
+100

...
 TransferQueueBundler: Message to stopped node blocks the bundler
thread
 -----------------------------------------------------------------------

                 Key: JGRP-2463
                 URL: https://issues.redhat.com/browse/JGRP-2463
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.2.1
            Reporter: Dan Berindei
            Assignee: Bela Ban
            Priority: Major
             Fix For: 4.2.2, 5.0.0.Alpha4

 {{TransferQueueBundler}} sends all the messages from a single thread. When one of the
{{TP.doSend()}} calls blocks, the bundler thread no longer makes any progress, and it
doesn't send messages to any destination, even if {{TP.doSend()}} is only slow for one
particular destination.
 One example is when sending a message to a stopped node, e.g. the coordinator sending a
{{LEAVE_RSP}} after the leaver has already stopped. The bundler thread calls
{{TP.doSend()}}, the connection no longer exists, so it ends up calling
{{BaseServer.createConnection()}}. If the stopped node's machine is no longer up or it
is configured to drop messages to closed ports, the connection open blocks the bundler
thread for {{TCP.sock_conn_timeout}}(default: 2s).
 {{UNICAST3}} also retransmits the highest sent message every {{UNICAST3.xmit_interval}}
(default: 500ms), for {{UNICAST3.max_retransmit_time}}(default: 1 min), so the bundler
thread will block more than once for the same message.
 I assume the bundler thread will also block if the transport is {{TCP}}, one of the
destinations is overloaded, and the TCP connection's send buffer is full. Normally
applications try to spread the workload evenly among members, but e.g. with RELAY2 not all
the members will be site masters. 

--
This message was sent by Atlassian Jira
(v7.13.8#713008)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006