[JBoss JIRA] (JGRP-2260) UNICAST3 doesn't remove dead nodes from its tables

Thursday, 29 March 2018

    [
https://issues.jboss.org/browse/JGRP-2260?page=com.atlassian.jira.plugin....
] 

Bela Ban commented on JGRP-2260:
--------------------------------

I created ForkChannelTest.testSimpleSend(). Take a look at it, especially at around line
104ff. I observed that the retransmission to the non-existent B ceases after ~1 minute.

...
 UNICAST3 doesn't remove dead nodes from its tables
 --------------------------------------------------

                 Key: JGRP-2260
                 URL: https://issues.jboss.org/browse/JGRP-2260
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.0.10
         Environment: WildFly 12.0.0.Final
            Reporter: Rich DiCroce
            Assignee: Bela Ban

 Scenario: 2 WildFly instances clustered together. A ForkChannel is defined, with a
MessageDispatcher on top. I start both nodes, then stop the second one. 6-7 minutes after
stopping the second node, I start getting log spam on the first node:
 {quote}
 12:47:04,519 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0),
site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0),
site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS
(flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
 12:47:06,522 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0),
site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0),
site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS
(flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
 12:47:08,524 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0),
site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0),
site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS
(flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
 {quote}
 After some debugging, I discovered that the reason is because UNICAST3 is still trying to
retransmit to the dead node. Its send_table still contains an entry for the dead node with
state OPEN.
 After looking at the source code for UNICAST3, I have a theory about what's
happening.
 * When a node leaves the cluster, down(Event) gets invoked with a view change, which
calls closeConnection(Address) for each node that left. That sets the connection state to
CLOSING.
 * Suppose that immediately after the view change is handled, a message with the dead node
as its destination gets passed to down(Message). That invokes getSenderEntry(Address),
which finds the connection... and sets the state back to OPEN.
 Consequently, the connection is never closed or removed from the table, so retransmit
attempts continue forever even though they will never succeed.
 This issue is easily reproducible for me, although unfortunately I can't give you the
application in question. But if you have fixes you want to try, I'm happy to drop in a
patched JAR and see if the issue still happens.
 This is my JGroups subsystem configuration:
 {code:xml}
         <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
             <channels default="ee">
                 <channel name="ee" stack="main">
                     <fork name="shared-dispatcher"/>
                     <fork name="group-topology"/>
                 </channel>
             </channels>
             <stacks>
                 <stack name="main">
                     <transport type="UDP" socket-binding="jgroups"
site="${gp.site:DEFAULT}"/>
                     <protocol type="PING"/>
                     <protocol type="MERGE3">
                         <property name="min_interval">
                             1000
                         </property>
                         <property name="max_interval">
                             5000
                         </property>
                     </protocol>
                     <protocol type="FD_SOCK"/>
                     <protocol type="FD_ALL2">
                         <property name="interval">
                             3000
                         </property>
                         <property name="timeout">
                             8000
                         </property>
                     </protocol>
                     <protocol type="VERIFY_SUSPECT"/>
                     <protocol type="pbcast.NAKACK2"/>
                     <protocol type="UNICAST3"/>
                     <protocol type="pbcast.STABLE"/>
                     <protocol type="pbcast.GMS">
                         <property name="join_timeout">
                             100
                         </property>
                     </protocol>
                     <protocol type="UFC"/>
                     <protocol type="MFC"/>
                     <protocol type="FRAG3"/>
                 </stack>
             </stacks>
         </subsystem>
 {code} 

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006