[jboss-jira] [JBoss JIRA] (JGRP-2260) UNICAST3 doesn't remove dead nodes from its tables

Wed Apr 4 09:41:01 EDT 2018

    [ https://issues.jboss.org/browse/JGRP-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555928#comment-13555928 ] 

Rich DiCroce commented on JGRP-2260:
------------------------------------

Created ISPN-9038 and WFLY-10171. Will close this issue.

> UNICAST3 doesn't remove dead nodes from its tables
> --------------------------------------------------
>
>                 Key: JGRP-2260
>                 URL: https://issues.jboss.org/browse/JGRP-2260
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.0.10
>         Environment: WildFly 12.0.0.Final
>            Reporter: Rich DiCroce
>            Assignee: Bela Ban
>
> Scenario: 2 WildFly instances clustered together. A ForkChannel is defined, with a MessageDispatcher on top. I start both nodes, then stop the second one. 6-7 minutes after stopping the second node, I start getting log spam on the first node:
> {quote}
> 12:47:04,519 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> 12:47:06,522 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> 12:47:08,524 WARN  [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> {quote}
> After some debugging, I discovered that the reason is because UNICAST3 is still trying to retransmit to the dead node. Its send_table still contains an entry for the dead node with state OPEN.
> After looking at the source code for UNICAST3, I have a theory about what's happening.
> * When a node leaves the cluster, down(Event) gets invoked with a view change, which calls closeConnection(Address) for each node that left. That sets the connection state to CLOSING.
> * Suppose that immediately after the view change is handled, a message with the dead node as its destination gets passed to down(Message). That invokes getSenderEntry(Address), which finds the connection... and sets the state back to OPEN.
> Consequently, the connection is never closed or removed from the table, so retransmit attempts continue forever even though they will never succeed.
> This issue is easily reproducible for me, although unfortunately I can't give you the application in question. But if you have fixes you want to try, I'm happy to drop in a patched JAR and see if the issue still happens.
> This is my JGroups subsystem configuration:
> {code:xml}
>         <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
>             <channels default="ee">
>                 <channel name="ee" stack="main">
>                     <fork name="shared-dispatcher"/>
>                     <fork name="group-topology"/>
>                 </channel>
>             </channels>
>             <stacks>
>                 <stack name="main">
>                     <transport type="UDP" socket-binding="jgroups" site="${gp.site:DEFAULT}"/>
>                     <protocol type="PING"/>
>                     <protocol type="MERGE3">
>                         <property name="min_interval">
>                             1000
>                         </property>
>                         <property name="max_interval">
>                             5000
>                         </property>
>                     </protocol>
>                     <protocol type="FD_SOCK"/>
>                     <protocol type="FD_ALL2">
>                         <property name="interval">
>                             3000
>                         </property>
>                         <property name="timeout">
>                             8000
>                         </property>
>                     </protocol>
>                     <protocol type="VERIFY_SUSPECT"/>
>                     <protocol type="pbcast.NAKACK2"/>
>                     <protocol type="UNICAST3"/>
>                     <protocol type="pbcast.STABLE"/>
>                     <protocol type="pbcast.GMS">
>                         <property name="join_timeout">
>                             100
>                         </property>
>                     </protocol>
>                     <protocol type="UFC"/>
>                     <protocol type="MFC"/>
>                     <protocol type="FRAG3"/>
>                 </stack>
>             </stacks>
>         </subsystem>
> {code}

--
This message was sent by Atlassian JIRA
(v7.5.0#75005)