[
https://issues.jboss.org/browse/JGRP-1549?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on JGRP-1549:
------------------------------------
I've written an Infinispan test that reproduces the problem pretty reliably (but with
a pretty invasive modification in ClusterTopologyManagerImpl, delaying the rebalance
confirmation). The sources are here:
https://github.com/danberindei/infinispan/tree/t_jgrp-1549_m
I tried to reproduce it with plain JGroups, but I wasn't successful. I think the
initial discovery phase changed the way connections were created(the Infinispan test suite
uses our custom TEST_PING protocol, so discovery doesn't create any connections).
I've also looked at the code in TCPConnectionMap and I think I see two problems:
1. After creating a connection, a node should check the just-created connection against
any existing connection in the map (only if it's open, obviously) and only replace it
if it satisfies the same check that's in ConnectionAcceptor.
2. On a send exception, the sender should only close the connection that it used to send
the message. The acceptor might have replaced the connection in the map sending was in
progress.
I think either one of these could explain why the initial message was dropped in the test.
I'm not sure why UNICAST2:STABLE doesn't kick in and force the re-transmission of
those messages for 15 seconds though...
TCP: handle concurrent connections more gracefully
--------------------------------------------------
Key: JGRP-1549
URL:
https://issues.jboss.org/browse/JGRP-1549
Project: JGroups
Issue Type: Enhancement
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 3.3
Attachments: cft.log.gz
When A connects to B and B connects to A *concurrently*, and no existing connections are
present, then one member (with the higher address) will prevail, and the other one will
close its connection and drop the message.
This is not usually an issue, as higher-up layers will retransmit the message, thus
re-establishing the connection.
However, if we have a protocol based on negative acks, such as UNICAST2, the
retransmission might take a while if that message was the last one.
SOLUTION:
The end that closes the connection should simply resend the message *once*, thus
re-creating the connection.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira