[
https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin....
]
Bela Ban commented on JGRP-1326:
--------------------------------
So the basic problem is what happens when we have an old (Entry.removable=true) entry E in
the logical_address_cache of TP ? This can happen with the following scenario:
* We have an entry for B (B is the logical address) in the logical_addr_cache
* Now B leaves the cluster and the next view excludes it.
** This does *not* remove the entry for B, but marks it as 'removable'
* Say member B is started again
* HOWEVER, we don't get the physical address for B, so the entry for B in
logical_address_cache is still the previous (stale) one
* Now a unicast to B will lookup the old (removable) entry
SOLUTION:
The simplest solution is probably to do a lookup (discovery) when sending a unicast to an
entry which is removable. Currently we only do this when the entry for a given logical
address is absent (null). The change would be to send the message when entry.removable is
true, but after that, to trigger a discovery request, so that when the discovery response
is received, the old and stale entry is replaced with the new information.
Gossip Router dropping message for node that is in its routing table
list
-------------------------------------------------------------------------
Key: JGRP-1326
URL:
https://issues.jboss.org/browse/JGRP-1326
Project: JGroups
Issue Type: Bug
Affects Versions: 2.10
Environment: Linux, Windows
Reporter: vivek v
Assignee: Vladimir Blagojevic
Fix For: 3.3
Attachments: pktmtntunnel.xml
We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing
lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the
nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A
was suspecting the coordinator, but no new view was being created. After turning on the
trace on both gossip routers (GR1 and GR2) I see following for every message that's
sent to Node A (manager_172.27.75.11),
{noformat}
2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find
manager_172.27.75.11:4576 in the routing table,
routing table=
172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576,
probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576,
probe_172.27.75.14:4576
{noformat}
Now, the issue is the routing table does indeed shows that there is
"manager_172.27.75.11" - so why is the GR dropping messages for that node. I
suspect that somehow the Gossip Router has got some old entry which has not been cleaned
up - different UUID with same logical address. I tried going through the GossipRouter.java
code, but couldn't find how would this be possible.
As I understand a node randomly chooses a GR if there are multiple of them for its
communication. Each GR would keep a separate list of physical addresses for each node - so
is it possible somehow it uses physical address instead of UUID for cleaning/retrieving
the node list?
This seems to be creating big issue and the only work around is to restart the Gossip
Routers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:
http://www.atlassian.com/software/jira