[jboss-jira] [JBoss JIRA] (JGRP-2260) UNICAST3 doesn't remove dead nodes from its tables
Bela Ban (JIRA)
issues at jboss.org
Wed Apr 4 06:25:29 EDT 2018
[ https://issues.jboss.org/browse/JGRP-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555716#comment-13555716 ]
Bela Ban commented on JGRP-2260:
--------------------------------
Can you close this issue?
> UNICAST3 doesn't remove dead nodes from its tables
> --------------------------------------------------
>
> Key: JGRP-2260
> URL: https://issues.jboss.org/browse/JGRP-2260
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 4.0.10
> Environment: WildFly 12.0.0.Final
> Reporter: Rich DiCroce
> Assignee: Bela Ban
>
> Scenario: 2 WildFly instances clustered together. A ForkChannel is defined, with a MessageDispatcher on top. I start both nodes, then stop the second one. 6-7 minutes after stopping the second node, I start getting log spam on the first node:
> {quote}
> 12:47:04,519 WARN [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> 12:47:06,522 WARN [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> 12:47:08,524 WARN [org.jgroups.protocols.UDP] (TQ-Bundler-4,ee,RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null)) JGRP000032: RCD_GP (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null): no physical address for RCD_NMS (flags=0), site-id=DEFAULT, rack-id=null, machine-id=null), dropping message
> {quote}
> After some debugging, I discovered that the reason is because UNICAST3 is still trying to retransmit to the dead node. Its send_table still contains an entry for the dead node with state OPEN.
> After looking at the source code for UNICAST3, I have a theory about what's happening.
> * When a node leaves the cluster, down(Event) gets invoked with a view change, which calls closeConnection(Address) for each node that left. That sets the connection state to CLOSING.
> * Suppose that immediately after the view change is handled, a message with the dead node as its destination gets passed to down(Message). That invokes getSenderEntry(Address), which finds the connection... and sets the state back to OPEN.
> Consequently, the connection is never closed or removed from the table, so retransmit attempts continue forever even though they will never succeed.
> This issue is easily reproducible for me, although unfortunately I can't give you the application in question. But if you have fixes you want to try, I'm happy to drop in a patched JAR and see if the issue still happens.
> This is my JGroups subsystem configuration:
> {code:xml}
> <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
> <channels default="ee">
> <channel name="ee" stack="main">
> <fork name="shared-dispatcher"/>
> <fork name="group-topology"/>
> </channel>
> </channels>
> <stacks>
> <stack name="main">
> <transport type="UDP" socket-binding="jgroups" site="${gp.site:DEFAULT}"/>
> <protocol type="PING"/>
> <protocol type="MERGE3">
> <property name="min_interval">
> 1000
> </property>
> <property name="max_interval">
> 5000
> </property>
> </protocol>
> <protocol type="FD_SOCK"/>
> <protocol type="FD_ALL2">
> <property name="interval">
> 3000
> </property>
> <property name="timeout">
> 8000
> </property>
> </protocol>
> <protocol type="VERIFY_SUSPECT"/>
> <protocol type="pbcast.NAKACK2"/>
> <protocol type="UNICAST3"/>
> <protocol type="pbcast.STABLE"/>
> <protocol type="pbcast.GMS">
> <property name="join_timeout">
> 100
> </property>
> </protocol>
> <protocol type="UFC"/>
> <protocol type="MFC"/>
> <protocol type="FRAG3"/>
> </stack>
> </stacks>
> </subsystem>
> {code}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
More information about the jboss-jira
mailing list