[jboss-jira] [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

Tuesday, 9 October 2012

    [
https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin....
] 

Bela Ban commented on JGRP-1326:
--------------------------------

So the basic problem is what happens when we have an old (Entry.removable=true) entry E in
the logical_address_cache of TP ? This can happen with the following scenario:

* We have an entry for B (B is the logical address) in the logical_addr_cache
* Now B leaves the cluster and the next view excludes it.
** This does *not* remove the entry for B, but marks it as 'removable'
* Say member B is started again
* HOWEVER, we don't get the physical address for B, so the entry for B in
logical_address_cache is still the previous (stale) one
* Now a unicast to B will lookup the old (removable) entry

SOLUTION:
The simplest solution is probably to do a lookup (discovery) when sending a unicast to an
entry which is removable. Currently we only do this when the entry for a given logical
address is absent (null). The change would be to send the message when entry.removable is
true, but after that, to trigger a discovery request, so that when the discovery response
is received, the old and stale entry is replaced with the new information.

...
 Gossip Router dropping message for node that is in its routing table
list
 -------------------------------------------------------------------------

                 Key: JGRP-1326
                 URL: https://issues.jboss.org/browse/JGRP-1326
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.10
         Environment: Linux, Windows
            Reporter: vivek v
            Assignee: Vladimir Blagojevic
             Fix For: 3.3

         Attachments: pktmtntunnel.xml

 We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing
lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the
nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A
was suspecting the coordinator, but no new view was being created. After turning on the
trace on both gossip routers (GR1 and GR2) I see following for every message that's
sent to Node A (manager_172.27.75.11),
 {noformat}
    2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find
manager_172.27.75.11:4576 in the routing table,
 routing table=
 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576,
probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576,
probe_172.27.75.14:4576    
 {noformat}
 Now, the issue is the routing table does indeed shows that there is
"manager_172.27.75.11" - so why is the GR dropping messages for that node. I
suspect that somehow the Gossip Router has got some old entry which has not been cleaned
up - different UUID with same logical address. I tried going through the GossipRouter.java
code, but couldn't find how would this be possible. 
 As I understand a node randomly chooses a GR if there are multiple of them for its
communication. Each GR would keep a separate list of physical addresses for each node - so
is it possible somehow it uses physical address instead of UUID for cleaning/retrieving
the node list? 
 This seems to be creating big issue and the only work around is to restart the Gossip
Routers. 
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list