]
Vladimir Blagojevic reassigned JGRP-1326:
-----------------------------------------
Assignee: Vladimir Blagojevic (was: Bela Ban)
Gossip Router dropping message for node that is in its routing table
list
-------------------------------------------------------------------------
Key: JGRP-1326
URL:
https://issues.jboss.org/browse/JGRP-1326
Project: JGroups
Issue Type: Bug
Affects Versions: 2.10
Environment: Linux, Windows
Reporter: vivek v
Assignee: Vladimir Blagojevic
Fix For: 2.12.3, 3.1
Attachments: pktmtntunnel.xml
We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing
lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the
nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A
was suspecting the coordinator, but no new view was being created. After turning on the
trace on both gossip routers (GR1 and GR2) I see following for every message that's
sent to Node A (manager_172.27.75.11),
{noformat}
2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find
manager_172.27.75.11:4576 in the routing table,
routing table=
172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576,
probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576,
probe_172.27.75.14:4576
{noformat}
Now, the issue is the routing table does indeed shows that there is
"manager_172.27.75.11" - so why is the GR dropping messages for that node. I
suspect that somehow the Gossip Router has got some old entry which has not been cleaned
up - different UUID with same logical address. I tried going through the GossipRouter.java
code, but couldn't find how would this be possible.
As I understand a node randomly chooses a GR if there are multiple of them for its
communication. Each GR would keep a separate list of physical addresses for each node - so
is it possible somehow it uses physical address instead of UUID for cleaning/retrieving
the node list?
This seems to be creating big issue and the only work around is to restart the Gossip
Routers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: