[
https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin....
]
vivek v commented on JGRP-1326:
-------------------------------
We saw this happening at one more customer site (running 6 nodes). It looks like the view
UUID and Gossip Router routing table UUID list are not in-sync. I'm suspecting somehow
there is old cached UUID in the membership list (in view), whereas the GR contains the
latest UUID - could be vice versa, but GR seem to get their UUID at connect time (and gets
overwritten with the new UUID every time there is a connect) and view can be cached by the
coordinator (received from PING) - and if for reason the old UUID is not removed it could
linger around. This is just a wild guess, but for sure looks like there are different
UUIDs for the same logical name in the GR and group membership.
{noformat}
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - ConnectionHandler[peer:
/10.0.38.148, logical_addrs: probe_10.0.38.148:4576] received
MESSAGE(group=10.0.19.249_group, addr=probe_10.0.38.148:4576, buffer: 90 bytes)
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - cannot find
probe_10.0.38.148:4576 in the routing table,
routing table=
10.0.19.249_group: manager_10.0.19.249:4576, probe_10.0.38.182:4576,
collector_10.0.19.127:4576, probe_10.0.41.103:4576, probe_10.0.38.33:4576,
probe_10.0.38.81:4576, probe_10.0.38.148:4576
{noformat}
In a related case at some different site we also sometimes see the following log on GR
when there are quite a few suspect messages going around. This is little different as this
seems like the node (no logical name, but just the long UUID hex address) is not there in
the routing table. Again, the view and GR list are out-of-sync.
{noformat}
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - ConnectionHandler[peer:
/10.0.38.148, logical_addrs: probe_10.0.38.148:4576] received
MESSAGE(group=10.0.19.249_group, addr=20ce7385-7330-10de-7dd8-d6ec8ac774d8, buffer: 90
bytes)
2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - cannot find
20ce7385-7330-10de-7dd8-d6ec8ac774d8 in the routing table,
routing table=
10.0.19.249_group: manager_10.0.19.249:4576, probe_10.0.38.182:4576,
collector_10.0.19.127:4576, probe_10.0.41.103:4576, probe_10.0.38.33:4576,
probe_10.0.38.81:4576, probe_10.0.38.148:4576
{noformat}
Note, we are using Tunnel with PING. Would using TCPGossip with Tunnel ensure that the two
list (in GR and Membership) are in-sync? I doubt it, but just a thought. I'm also not
sure if having two GRs has anything to do with this. I'm still trying to figure out
what are the different scenarios when the GR list can get out-of-sync with the membership
list on nodes?
Gossip Router dropping message for node that is in its routing table
list
-------------------------------------------------------------------------
Key: JGRP-1326
URL:
https://issues.jboss.org/browse/JGRP-1326
Project: JGroups
Issue Type: Bug
Affects Versions: 2.10
Environment: Linux, Windows
Reporter: vivek v
Assignee: Bela Ban
Fix For: 2.12.2, 3.1
Attachments: pktmtntunnel.xml
We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing
lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the
nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A
was suspecting the coordinator, but no new view was being created. After turning on the
trace on both gossip routers (GR1 and GR2) I see following for every message that's
sent to Node A (manager_172.27.75.11),
{noformat}
2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find
manager_172.27.75.11:4576 in the routing table,
routing table=
172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576,
probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576,
probe_172.27.75.14:4576
{noformat}
Now, the issue is the routing table does indeed shows that there is
"manager_172.27.75.11" - so why is the GR dropping messages for that node. I
suspect that somehow the Gossip Router has got some old entry which has not been cleaned
up - different UUID with same logical address. I tried going through the GossipRouter.java
code, but couldn't find how would this be possible.
As I understand a node randomly chooses a GR if there are multiple of them for its
communication. Each GR would keep a separate list of physical addresses for each node - so
is it possible somehow it uses physical address instead of UUID for cleaning/retrieving
the node list?
This seems to be creating big issue and the only work around is to restart the Gossip
Routers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira