[JBoss JIRA] Created: (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[JBoss JIRA] Created: (JBAS-8134)...

[JBoss JIRA] (AS7-5574) Switch to...

vivek v (JIRA)

Friday, 20 May 2011 Fri, 20 May '11

5:57 p.m.

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Bela Ban We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

Show replies by date

Bela Ban (JIRA)

Saturday, 21 May Sat, 21 May

12:45 a.m.

New subject: [JBoss JIRA] Updated: (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1326: --------------------------- Fix Version/s: 2.12.2 3.1

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.2, 3.1 We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

-- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

vivek v (JIRA)

Monday, 23 May Mon, 23 May

10 p.m.

New subject: [JBoss JIRA] Updated: (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] vivek v updated JGRP-1326: -------------------------- Attachment: pktmtntunnel.xml

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.2, 3.1 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

-- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

vivek v (JIRA)

Wednesday, 25 May Wed, 25 May

4:15 p.m.

New subject: [JBoss JIRA] Commented: (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] vivek v commented on JGRP-1326: ------------------------------- We saw this happening at one more customer site (running 6 nodes). It looks like the view UUID and Gossip Router routing table UUID list are not in-sync. I'm suspecting somehow there is old cached UUID in the membership list (in view), whereas the GR contains the latest UUID - could be vice versa, but GR seem to get their UUID at connect time (and gets overwritten with the new UUID every time there is a connect) and view can be cached by the coordinator (received from PING) - and if for reason the old UUID is not removed it could linger around. This is just a wild guess, but for sure looks like there are different UUIDs for the same logical name in the GR and group membership. {noformat} 2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - ConnectionHandler[peer: /10.0.38.148, logical_addrs: probe_10.0.38.148:4576] received MESSAGE(group=10.0.19.249_group, addr=probe_10.0.38.148:4576, buffer: 90 bytes) 2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - cannot find probe_10.0.38.148:4576 in the routing table, routing table= 10.0.19.249_group: manager_10.0.19.249:4576, probe_10.0.38.182:4576, collector_10.0.19.127:4576, probe_10.0.41.103:4576, probe_10.0.38.33:4576, probe_10.0.38.81:4576, probe_10.0.38.148:4576 {noformat} In a related case at some different site we also sometimes see the following log on GR when there are quite a few suspect messages going around. This is little different as this seems like the node (no logical name, but just the long UUID hex address) is not there in the routing table. Again, the view and GR list are out-of-sync. {noformat} 2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - ConnectionHandler[peer: /10.0.38.148, logical_addrs: probe_10.0.38.148:4576] received MESSAGE(group=10.0.19.249_group, addr=20ce7385-7330-10de-7dd8-d6ec8ac774d8, buffer: 90 bytes) 2011-05-25 16:27:36,205 TRACE [gossip-handlers-8] GossipRouter - cannot find 20ce7385-7330-10de-7dd8-d6ec8ac774d8 in the routing table, routing table= 10.0.19.249_group: manager_10.0.19.249:4576, probe_10.0.38.182:4576, collector_10.0.19.127:4576, probe_10.0.41.103:4576, probe_10.0.38.33:4576, probe_10.0.38.81:4576, probe_10.0.38.148:4576 {noformat} Note, we are using Tunnel with PING. Would using TCPGossip with Tunnel ensure that the two list (in GR and Membership) are in-sync? I doubt it, but just a thought. I'm also not sure if having two GRs has anything to do with this. I'm still trying to figure out what are the different scenarios when the GR list can get out-of-sync with the membership list on nodes?

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.2, 3.1 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

-- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (Updated) (JIRA)

Tuesday, 18 October Tue, 18 Oct

6:16 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1326: --------------------------- Fix Version/s: 2.12.3 (was: 2.12.2)

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Vladimir Blagojevic (Assigned) (JIRA)

Wednesday, 16 November Wed, 16 Nov

9:34 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Vladimir Blagojevic reassigned JGRP-1326: ----------------------------------------- Assignee: Vladimir Blagojevic (was: Bela Ban)

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

Monday, 16 January Mon, 16 Jan

6:35 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1326: --------------------------- Fix Version/s: (was: 2.12.3)

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.1 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

6:41 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1326: --------------------------- Fix Version/s: 3.2 (was: 3.1)

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.2 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

Saturday, 25 August Sat, 25 Aug

4:58 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1326: -------------------------------- Any updates on this ? Can this be reproduced in 3.x ? Steps to reproduce ? If there isn't any activity on this case, I'll close it soon

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.2 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

Tuesday, 28 August Tue, 28 Aug

5:51 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1326: --------------------------- Fix Version/s: 3.3 (was: 3.2) Pushed to 3.3. Will be closed though unless there's some activity on the case...

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.3 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

Tuesday, 9 October Tue, 9 Oct

9:27 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1326: -------------------------------- So the basic problem is what happens when we have an old (Entry.removable=true) entry E in the logical_address_cache of TP ? This can happen with the following scenario: * We have an entry for B (B is the logical address) in the logical_addr_cache * Now B leaves the cluster and the next view excludes it. ** This does *not* remove the entry for B, but marks it as 'removable' * Say member B is started again * HOWEVER, we don't get the physical address for B, so the entry for B in logical_address_cache is still the previous (stale) one * Now a unicast to B will lookup the old (removable) entry SOLUTION: The simplest solution is probably to do a lookup (discovery) when sending a unicast to an entry which is removable. Currently we only do this when the entry for a given logical address is absent (null). The change would be to send the message when entry.removable is true, but after that, to trigger a discovery request, so that when the discovery response is received, the old and stale entry is replaced with the new information.

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.3 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (JIRA)

9:32 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1326: --------------------------- Fix Version/s: 3.2 (was: 3.3)

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.2 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

9:42 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1326: -------------------------------- Actually, there's a simple workaround: - Set TP.logical_addr_cache_max_size to a small value, ie. the max number of cluster members (e.g. 7) - Set TP.logical_addr_cache_expiration to (say) 30000 (= 30 secs) This means that the cache will grow 7 members, and any addition over 7 will remove all members (1) whose expiration is greater than 30s *and* who are marked as 'removable'. What this means is that when we have 7 members or more, we purge entries for old and expired members every time we add a new member. One could also set TP.logical_addr_cache_max_size to 1, then we'd purge removable and expired entries on every addition of a new entry.

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.2 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

Bela Ban (JIRA)

9:44 a.m.

New subject: [JBoss JIRA] (JGRP-1326) Gossip Router dropping message for node that is in its routing table list

[ https://issues.jboss.org/browse/JGRP-1326?page=com.atlassian.jira.plugin.... ] Bela Ban resolved JGRP-1326. ---------------------------- Resolution: Won't Fix See the simple workaround described in the previous comment

...

Gossip Router dropping message for node that is in its routing table list ------------------------------------------------------------------------- Key: JGRP-1326 URL: https://issues.jboss.org/browse/JGRP-1326 Project: JGroups Issue Type: Bug Affects Versions: 2.10 Environment: Linux, Windows Reporter: vivek v Assignee: Vladimir Blagojevic Fix For: 3.2 Attachments: pktmtntunnel.xml We are using Tunnel protocol with two Gossip Routers. For some reason we start seeing lots of suspect messages in all the nodes - there are 7 nodes in the group. Six of the nodes (including the coordinator) was suspecting node A (manager_172.27.75.11) and node A was suspecting the coordinator, but no new view was being created. After turning on the trace on both gossip routers (GR1 and GR2) I see following for every message that's sent to Node A (manager_172.27.75.11), {noformat} 2011-05-20 15:56:21,186 TRACE [gossip-handlers-6] GossipRouter - cannot find manager_172.27.75.11:4576 in the routing table, routing table= 172.27.75.11_group: probe_172.27.75.13:4576, collector_172.27.75.12:4576, probe_172.27.75.15:4576, manager_172.27.75.11:4576, probe_172.27.75.16:4576, probe_172.27.75.14:4576 {noformat} Now, the issue is the routing table does indeed shows that there is "manager_172.27.75.11" - so why is the GR dropping messages for that node. I suspect that somehow the Gossip Router has got some old entry which has not been cleaned up - different UUID with same logical address. I tried going through the GossipRouter.java code, but couldn't find how would this be possible. As I understand a node randomly chooses a GR if there are multiple of them for its communication. Each GR would keep a separate list of physical addresses for each node - so is it possible somehow it uses physical address instead of UUID for cleaning/retrieving the node list? This seems to be creating big issue and the only work around is to restart the Gossip Routers.

4741

days inactive

5249

days old

jboss-jira@lists.jboss.org

Manage subscription

13 comments

4 participants

tags (0)

participants (4)

Bela Ban (JIRA)
Bela Ban (Updated) (JIRA)
vivek v (JIRA)
Vladimir Blagojevic (Assigned) (JIRA)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBoss JIRA] Created: (JGRP-1326) Gossip Router dropping message for node that is in its routing table list