[jboss-jira] [JBoss JIRA] (JGRP-2162) Failed to send broadcast when opening the connection

Tuesday, 9 May 2017

    [
https://issues.jboss.org/browse/JGRP-2162?page=com.atlassian.jira.plugin....
] 

Bela Ban commented on JGRP-2162:
--------------------------------

This looks like 2 separate issues. Let me address the first one first.

When initial_hosts is A only, then the caches will be after B and C join
* A: ABC
* B: AB
* C: AC

When sending a multicast, A would succeed as it has all addresses of the other members,
but B would fail sending the message to C and C would fail sending the message to B.

There are 3 ways to resolve this:
1. Include all hosts (or as many as possible) in {{TCPPING.initial_hosts}}
2. Set {{TCPPING.send_cache_on_join}} to {{true}}
3. Use a dynamic discovery protocol

Note that this is not an issue in {{UDP}} as a (group) multicast results in an IP
multicast, whereas we have to send the same message multiple times in {{TCP_NIO2}}.

...
 Failed to send broadcast when opening the connection
 ----------------------------------------------------

                 Key: JGRP-2162
                 URL: https://issues.jboss.org/browse/JGRP-2162
             Project: JGroups
          Issue Type: Bug
            Reporter: Radim Vansa
            Assignee: Bela Ban
             Fix For: 4.0.3

         Attachments: TcpNio2McastTest.java, infinispan_2.log.gz

 IRC discussion:
 {quote}
  bela_: Hi Bela, I have a weird failure in one test that seem to be rooted in JGroups.
TCP_NIO2 is in charge, and there's a broadcast message to all nodes, but it seems
it's not received on the other side.
 <bela_> rvansa: reproducible?
 <rvansa> bela_: it happens when the connection to a node is just being opened: I
have added some trace logs and just a moment before writing to the NioConnection.send_buf
it was in state "connection pending"
 <rvansa> bela_: sort of, after tens of runs of that test (on my machine) - and
I've seen it first time in CI, so it could be
 <bela_> rvansa: NioConnection buffers writes up to a certain extent, then  discards
anything over the buffer limit
 <bela_> rvansa: max_send_buffers (default: 10). But retransmission should fix this,
unless you don’t wait long enough
 <rvansa> bela_: I don't think it should go over the limit
 <rvansa> bela_: the test is not doing anything else, just sending CommitCommand
(that should be couple hundred bytes at most) and then waiting
 <rvansa> bela_: according to the traces I've added, Buffers.write returned
false when writing the local address, and then true when writing the actual message
 {quote}
 I have been trying to write a reproducer, and found that it's related to the fact
that the failing test uses custom (fake) discovery protocol, that doesn't open the
connection during startup. In my ~reproducer I had to modify tcp-nio.xml to use TCPPING
with only the first node in hosts list (localhost[7800]):
 {code:xml}
 <TCPPING async_discovery="true"
initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800]}"
port_range="0"/>
 {code}
 This causes that the physical connection is not opened by discovery. However, the
reproducer suffers from (always reproducible) flaw - it does not send the message to third
node at all (and the test fails, therefore).
 Note that increasing the timeout in request options does not help. 

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-2162) Failed to send broadcast when opening the connection