]
Bela Ban commented on JGRP-2162:
--------------------------------
This looks like 2 separate issues. Let me address the first one first.
When initial_hosts is A only, then the caches will be after B and C join
* A: ABC
* B: AB
* C: AC
When sending a multicast, A would succeed as it has all addresses of the other members,
but B would fail sending the message to C and C would fail sending the message to B.
There are 3 ways to resolve this:
1. Include all hosts (or as many as possible) in {{TCPPING.initial_hosts}}
2. Set {{TCPPING.send_cache_on_join}} to {{true}}
3. Use a dynamic discovery protocol
Note that this is not an issue in {{UDP}} as a (group) multicast results in an IP
multicast, whereas we have to send the same message multiple times in {{TCP_NIO2}}.
Failed to send broadcast when opening the connection
----------------------------------------------------
Key: JGRP-2162
URL:
https://issues.jboss.org/browse/JGRP-2162
Project: JGroups
Issue Type: Bug
Reporter: Radim Vansa
Assignee: Bela Ban
Fix For: 4.0.3
Attachments: TcpNio2McastTest.java, infinispan_2.log.gz
IRC discussion:
{quote}
bela_: Hi Bela, I have a weird failure in one test that seem to be rooted in JGroups.
TCP_NIO2 is in charge, and there's a broadcast message to all nodes, but it seems
it's not received on the other side.
<bela_> rvansa: reproducible?
<rvansa> bela_: it happens when the connection to a node is just being opened: I
have added some trace logs and just a moment before writing to the NioConnection.send_buf
it was in state "connection pending"
<rvansa> bela_: sort of, after tens of runs of that test (on my machine) - and
I've seen it first time in CI, so it could be
<bela_> rvansa: NioConnection buffers writes up to a certain extent, then discards
anything over the buffer limit
<bela_> rvansa: max_send_buffers (default: 10). But retransmission should fix this,
unless you don’t wait long enough
<rvansa> bela_: I don't think it should go over the limit
<rvansa> bela_: the test is not doing anything else, just sending CommitCommand
(that should be couple hundred bytes at most) and then waiting
<rvansa> bela_: according to the traces I've added, Buffers.write returned
false when writing the local address, and then true when writing the actual message
{quote}
I have been trying to write a reproducer, and found that it's related to the fact
that the failing test uses custom (fake) discovery protocol, that doesn't open the
connection during startup. In my ~reproducer I had to modify tcp-nio.xml to use TCPPING
with only the first node in hosts list (localhost[7800]):
{code:xml}
<TCPPING async_discovery="true"
initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800]}"
port_range="0"/>
{code}
This causes that the physical connection is not opened by discovery. However, the
reproducer suffers from (always reproducible) flaw - it does not send the message to third
node at all (and the test fails, therefore).
Note that increasing the timeout in request options does not help.