[jboss-jira] [JBoss JIRA] Commented: (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

Wed May 25 17:15:01 EDT 2011

    [ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604463#comment-12604463 ] 

vivek v commented on JGRP-1326:
-------------------------------

We saw this happening at one more customer site (running 6 nodes). It looks like the view UUID and Gossip Router routing table UUID list are not in-sync. I'm suspecting somehow there is old cached UUID in the membership list (in view), whereas the GR contains the latest UUID - could be vice versa, but GR seem to get their UUID at connect time (and gets overwritten with the new UUID every time there is a connect) and view can be cached by the coordinator (received from PING) - and if for reason the old UUID is not removed it could linger around. This is just a wild guess, but for sure looks like there are different UUIDs for the same logical name in the GR and group membership. 
{noformat}
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - ConnectionHandler[peer: /10.0.38.148, logical_addrs: probe_10.0.38.148:4576] received MESSAGE(group=10.0.19.249_group, addr=probe_10.0.38.148:4576, buffer: 90 bytes)
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - cannot find probe_10.0.38.148:4576 in the routing table,
routing table=
10.0.19.249_group: manager_10.0.19.249:4576, probe_10.0.38.182:4576, collector_10.0.19.127:4576, probe_10.0.41.103:4576, probe_10.0.38.33:4576, probe_10.0.38.81:4576, probe_10.0.38.148:4576

{noformat}

In a related case at some different site we also sometimes see the following log on GR when there are quite a few suspect messages going around. This is little different as this seems like the node (no logical name, but just the long UUID hex address) is not there in the routing table. Again, the view and GR list are out-of-sync. 

{noformat}
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - ConnectionHandler[peer: /10.0.38.148, logical_addrs: probe_10.0.38.148:4576] received MESSAGE(group=10.0.19.249_group, addr=20ce7385-7330-10de-7dd8-d6ec8ac774d8, buffer: 90 bytes)
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - cannot find 20ce7385-7330-10de-7dd8-d6ec8ac774d8 in the routing table,
routing table=
10.0.19.249_group: manager_10.0.19.249:4576, probe_10.0.38.182:4576, collector_10.0.19.127:4576, probe_10.0.41.103:4576, probe_10.0.38.33:4576, probe_10.0.38.81:4576, probe_10.0.38.148:4576   
{noformat}

Note, we are using Tunnel with PING. Would using TCPGossip with Tunnel ensure that the two list (in GR and Membership) are in-sync? I doubt it, but just a thought. I'm also not sure if having two GRs has anything to do with this. I'm still trying to figure out what are the different scenarios when the GR list can get out-of-sync with the membership list on nodes? 

> Gossip Router dropping message for node that is in its routing table list
> -------------------------------------------------------------------------
>
>                 Key: JGRP-1326
>                 URL: https://issues.jboss.org/browse/JGRP-1326
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.10
>         Environment: Linux, Windows
>            Reporter: vivek v
>            Assignee: Bela Ban
>             Fix For: 2.12.2, 3.1
>
>         Attachments: pktmtntunnel.xml
>
>
> We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11),
> {noformat}
>    2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table,
> routing table=
> 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576    
> {noformat}
> Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. 
> As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? 
> This seems to be creating big issue and the only work around is to restart the Gossip Routers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira