[jboss-jira] [JBoss JIRA] (JGRP-2162) Failed to send broadcast when opening the connection

Wed May 10 02:22:00 EDT 2017

    [ https://issues.jboss.org/browse/JGRP-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403771#comment-13403771 ] 

Bela Ban edited comment on JGRP-2162 at 5/10/17 2:21 AM:
---------------------------------------------------------

This looks like 2 separate issues. Let me address the first one first.

When initial_hosts is A only, then the caches will be after B and C join
* A: ABC
* B: AB
* C: AC

When sending a multicast, A would succeed as it has all addresses of the other members, but B would fail sending the message to C and C would fail sending the message to B.

Also, NAKACK2 won't retransmit as the receivers (B or C) never receive C's or B' message, so they won't ask the sender for retransmission.

There are 3 ways to resolve this:
1. Include all hosts (or as many as possible) in {{TCPPING.initial_hosts}}
2. Set {{TCPPING.send_cache_on_join}} to {{true}}
3. Use a dynamic discovery protocol

Note that this is not an issue in {{UDP}} as a (group) multicast results in an IP multicast, whereas we have to send the same message multiple times in {{TCP_NIO2}}.

was (Author: belaban):
This looks like 2 separate issues. Let me address the first one first.

When initial_hosts is A only, then the caches will be after B and C join
* A: ABC
* B: AB
* C: AC

When sending a multicast, A would succeed as it has all addresses of the other members, but B would fail sending the message to C and C would fail sending the message to B.

There are 3 ways to resolve this:
1. Include all hosts (or as many as possible) in {{TCPPING.initial_hosts}}
2. Set {{TCPPING.send_cache_on_join}} to {{true}}
3. Use a dynamic discovery protocol

Note that this is not an issue in {{UDP}} as a (group) multicast results in an IP multicast, whereas we have to send the same message multiple times in {{TCP_NIO2}}.

> Failed to send broadcast when opening the connection
> ----------------------------------------------------
>
>                 Key: JGRP-2162
>                 URL: https://issues.jboss.org/browse/JGRP-2162
>             Project: JGroups
>          Issue Type: Bug
>            Reporter: Radim Vansa
>            Assignee: Bela Ban
>             Fix For: 4.0.3
>
>         Attachments: TcpNio2McastTest.java, infinispan_2.log.gz
>
>
> IRC discussion:
> {quote}
>  bela_: Hi Bela, I have a weird failure in one test that seem to be rooted in JGroups. TCP_NIO2 is in charge, and there's a broadcast message to all nodes, but it seems it's not received on the other side.
> <bela_> rvansa: reproducible?
> <rvansa> bela_: it happens when the connection to a node is just being opened: I have added some trace logs and just a moment before writing to the NioConnection.send_buf it was in state "connection pending"
> <rvansa> bela_: sort of, after tens of runs of that test (on my machine) - and I've seen it first time in CI, so it could be
> <bela_> rvansa: NioConnection buffers writes up to a certain extent, then  discards anything over the buffer limit
> <bela_> rvansa: max_send_buffers (default: 10). But retransmission should fix this, unless you don’t wait long enough
> <rvansa> bela_: I don't think it should go over the limit
> <rvansa> bela_: the test is not doing anything else, just sending CommitCommand (that should be couple hundred bytes at most) and then waiting
> <rvansa> bela_: according to the traces I've added, Buffers.write returned false when writing the local address, and then true when writing the actual message
> {quote}
> I have been trying to write a reproducer, and found that it's related to the fact that the failing test uses custom (fake) discovery protocol, that doesn't open the connection during startup. In my ~reproducer I had to modify tcp-nio.xml to use TCPPING with only the first node in hosts list (localhost[7800]):
> {code:xml}
> <TCPPING async_discovery="true" initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800]}" port_range="0"/>
> {code}
> This causes that the physical connection is not opened by discovery. However, the reproducer suffers from (always reproducible) flaw - it does not send the message to third node at all (and the test fails, therefore).
> Note that increasing the timeout in request options does not help.

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)