[jboss-jira] [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

Tue Oct 9 10:27:03 EDT 2012

    [ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725012#comment-12725012 ] 

Bela Ban commented on JGRP-1326:
--------------------------------

So the basic problem is what happens when we have an old (Entry.removable=true) entry E in the logical_address_cache of TP ? This can happen with the following scenario:

* We have an entry for B (B is the logical address) in the logical_addr_cache
* Now B leaves the cluster and the next view excludes it.
** This does *not* remove the entry for B, but marks it as 'removable'
* Say member B is started again
* HOWEVER, we don't get the physical address for B, so the entry for B in logical_address_cache is still the previous (stale) one
* Now a unicast to B will lookup the old (removable) entry

SOLUTION:
The simplest solution is probably to do a lookup (discovery) when sending a unicast to an entry which is removable. Currently we only do this when the entry for a given logical address is absent (null). The change would be to send the message when entry.removable is true, but after that, to trigger a discovery request, so that when the discovery response is received, the old and stale entry is replaced with the new information.

> Gossip Router dropping message for node that is in its routing table list
> -------------------------------------------------------------------------
>
>                 Key: JGRP-1326
>                 URL: https://issues.jboss.org/browse/JGRP-1326
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.10
>         Environment: Linux, Windows
>            Reporter: vivek v
>            Assignee: Vladimir Blagojevic
>             Fix For: 3.3
>
>         Attachments: pktmtntunnel.xml
>
>
> We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11),
> {noformat}
>    2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table,
> routing table=
> 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576    
> {noformat}
> Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. 
> As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? 
> This seems to be creating big issue and the only work around is to restart the Gossip Routers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira