[JBoss JIRA] Created: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[JBoss JIRA] (AS7-5447) The footer...

[JBoss JIRA] (AS7-5709) Language...

vivek v (JIRA)

Tuesday, 9 March 2010 Tue, 9 Mar '10

9:02 p.m.

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://jira.jboss.org/jira/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.9, 2.8 Reporter: vivek v Assignee: Bela Ban There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira

Show replies by date

Bela Ban (JIRA)

Wednesday, 10 March Wed, 10 Mar

1:47 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://jira.jboss.org/jira/browse/JGRP-1165?page=com.atlassian.jira.plug... ] Bela Ban commented on JGRP-1165: -------------------------------- The merge should not take 10 minutes, unless you configured MERGE2 with a (too) high timeout ! I'd also configure the timeout in FD to be higher... I'll take a look at your suggestion for N2 to drop N1 and N3 if it gets a view in which it isn't member

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://jira.jboss.org/jira/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9 Reporter: vivek v Assignee: Bela Ban There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

-- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (JIRA)

1:50 a.m.

New subject: [JBoss JIRA] Updated: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://jira.jboss.org/jira/browse/JGRP-1165?page=com.atlassian.jira.plug... ] Bela Ban updated JGRP-1165: --------------------------- Fix Version/s: 2.10

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://jira.jboss.org/jira/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10 There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

1:50 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://jira.jboss.org/jira/browse/JGRP-1165?page=com.atlassian.jira.plug... ] Bela Ban commented on JGRP-1165: -------------------------------- If you have asymmetric merges as described above, I recommend bounding RPCs with a timout, e.g. 5 seconds, so the calls always return after 5 s at most

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://jira.jboss.org/jira/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10 There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

Wednesday, 31 March Wed, 31 Mar

7:30 a.m.

New subject: [JBoss JIRA] Commented: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://jira.jboss.org/jira/browse/JGRP-1165?page=com.atlassian.jira.plug... ] Bela Ban commented on JGRP-1165: -------------------------------- I'm hesitant to do this: any spurious view can wreak havoc on the cluster. E.g. after a merge we have {A,B,C,D,E,F,G} and receive a spurious view (e.g. due to simple retransmission, with a timer stopped a few ms too late) {G}, and now everyone but G removes all members not in the view ! A view is something that's installed as a result of an agreement process (GMS) and we should not take decisions regarding views locally, without consensus ! In your case above: N1: V2 {N1, N3) N2: V1 {N1, N2, N3} N3: V2 {N1, N3} , MERGE2 should merge V1 and V2 back into V3 within a short amount of time, typically between MERGE2.min_interval and MERGE2.max_interval milliseconds. It should definitely *not* take 10 minutes ! We should therefore fix the cause for why this is taking 10 minutes, rather the unilaterally install a view. Knowing you use GossipRouter, I suspect this is rather an issue in the discovery protocol than in MERGE2. Fixing the former should render this case moot. I'm closing this issue, please feel free to re-open if you have new information pertaining to it.

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://jira.jboss.org/jira/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10 There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

7:32 a.m.

New subject: [JBoss JIRA] Closed: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://jira.jboss.org/jira/browse/JGRP-1165?page=com.atlassian.jira.plug... ] Bela Ban closed JGRP-1165. -------------------------- Resolution: Rejected

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://jira.jboss.org/jira/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10 There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

vivek v (JIRA)

Thursday, 8 September Thu, 8 Sep

6:54 p.m.

New subject: [JBoss JIRA] Reopened: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] vivek v reopened JGRP-1165: --------------------------- Bela, We are currently on 2.12.1 and I see see this issue where a node is getting isolated and never able to join back. Here is what's happening, 1) All nodes on view 1 V1{n1,n2,n3}- everything is fine here 2) Due to some network issues new view is created on n1 and n3, V2{n3,n1} 3) n2 gets a merge view, but rejects it saying it's not in the merge view, {noformat} 2011-09-04 10:07:08,399 WARN [Incoming-9,204.99.64.103_group,probe_10.112.1.130:4576] GMS - probe_10.112.1.130:4576: not member of view MergeView::[probe_10.24.1.135:4576|17] [probe_10.24.1.135:4576, probe_10.84.9.149:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576, probe_10.36.12.137:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576, collector_204.99.64.104:4576, probe_10.36.12.138:4576, probe_204.99.64.107:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.84.15.15:4576], subgroups=[[probe_10.24.1.135:4576|16] [probe_10.24.1.135:4576, probe_10.84.9.149:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576, probe_10.36.12.137:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576, collector_204.99.64.104:4576, probe_10.36.12.138:4576, probe_204.99.64.107:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.84.15.15:4576], [probe_10.24.1.135:4576|16] [manager_204.99.64.103:4576, probe_204.99.64.105:4576, probe_204.99.64.106:4576, probe_204.99.64.108:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576, probe_10.84.9.149:4576, probe_10.84.15.15:4576, probe_10.36.12.137:4576, probe_10.36.12.138:4576, probe_10.24.1.135:4576, probe_204.99.64.107:4576]]; discarding it {noformat} 4) n1 discards message from n2 with NAKACK, {noformat} 2011-09-08 16:29:07,122 WARN [OOB-5,204.99.64.103_group,manager_204.99.64.103:4576] NAKACK - manager_204.99.64.103:4576: dropped message from probe_10.112.1.130:4576 (not in table [probe_10.24.1.135:4576, probe_10.36.12.137:4576, probe_10.84.9.149:4576, probe_10.84.15.15:4576, collector_204.99.64.104:4576, probe_10.103.1.130:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.126.1.106:4576, probe_204.99.64.107:4576, probe_10.36.12.138:4576]), view=MergeView::[probe_10.24.1.135:4576|17] [probe_10.24.1.135:4576, probe_10.84.9.149:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576, probe_10.36.12.137:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576, collector_204.99.64.104:4576, probe_10.36.12.138:4576, probe_204.99.64.107:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.84.15.15:4576], subgroups=[[probe_10.24.1.135:4576|16] [probe_10.24.1.135:4576, probe_10.84.9.149:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576, probe_10.36.12.137:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576, collector_204.99.64.104:4576, probe_10.36.12.138:4576, probe_204.99.64.107:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.84.15.15:4576], [probe_10.24.1.135:4576|16] [manager_204.99.64.103:4576, probe_204.99.64.105:4576, probe_204.99.64.106:4576, probe_204.99.64.108:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576, probe_10.84.9.149:4576, probe_10.84.15.15:4576, probe_10.36.12.137:4576, probe_10.36.12.138:4576, probe_10.24.1.135:4576, probe_204.99.64.107:4576]] {noformat} 5) Now n2 remains isolated is never able to join back. Attached is our protocol stack (we use tunneling with two Gossip Routers for load balancing and redundancy). Re-opening for Bela to look at this logic again.

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10 There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

-- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

vivek v (JIRA)

6:56 p.m.

New subject: [JBoss JIRA] Updated: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] vivek v updated JGRP-1165: -------------------------- Fix Version/s: 2.12.2 Affects Version/s: 2.12.1

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10, 2.12.2 There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

-- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

vivek v (JIRA)

6:59 p.m.

New subject: [JBoss JIRA] Updated: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] vivek v updated JGRP-1165: -------------------------- Attachment: pktmtntunnel.xml

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.10, 2.12.2 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

-- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (Updated) (JIRA)

Tuesday, 18 October Tue, 18 Oct

6:18 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Fix Version/s: 2.12.3 3.1 (was: 2.10) (was: 2.12.2)

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (Commented) (JIRA)

Wednesday, 16 November Wed, 16 Nov

10:22 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1165: -------------------------------- Sorry it took me so long to look at this issue ? Is it still relevant ?

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (Commented) (JIRA)

10:26 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1165: -------------------------------- OK, here's what I did to reproduce this (only on 3.0 !): - Commented ENCRYPT and AUTH, didn't want to deal with those issues and see if there's a problem in the underlying merging code - Reduced the timeouts in FD_ALL, MERGE and PING - Added <DISCARD use_gui="true"/> over TUNNEL - Started a GossipRouter on port 12001 - Started 3 instances of Draw: java -Djava.net.preferIPv4Stack=true org.jgroups.demos.Draw -props ./pktmux.xml -name A | B | B - The 3 members should form a cluster - Then selected "discard traffic from B" in A and C - A and C install view {A,C} after a few (ca 10) seconds - B still has the same view - After a few seconds, {A,C} and {B} merge into {A,B,C}

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (Commented) (JIRA)

10:30 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1165: -------------------------------- You can re-create this if you want. The ZIP for the latest JGroups (will be 3.0.0.Final) can be downloaded at https://github.com/belaban/JGroups/zipball/master. I'm attaching the changed config file for reference as well.

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (Updated) (JIRA)

10:30 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Attachment: pktmtntunnel.xml

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (Commented) (JIRA)

10:30 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1165: -------------------------------- Perhaps [1] fixed this issue... Note that I haven't looked at adding ENCRYPT and AUTH back to your config... [1] https://issues.jboss.org/browse/JGRP-1379

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 2.12.3, 3.1 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (Updated) (JIRA)

10:32 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Fix Version/s: (was: 2.12.3)

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.1 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (Commented) (JIRA)

Friday, 25 November Fri, 25 Nov

2:24 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban commented on JGRP-1165: -------------------------------- I'm going to soon close this case unless I hear from you, Vivek. Can you try to reproduce it under 3.0 ? Cheers,

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.1 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

Monday, 16 January Mon, 16 Jan

6:41 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Fix Version/s: 3.2 (was: 3.1)

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.2 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

Tuesday, 28 August Tue, 28 Aug

5:55 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Fix Version/s: 3.3 (was: 3.2)

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.3 Attachments: pktmtntunnel.xml, pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

Tuesday, 9 October Tue, 9 Oct

5:38 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Attachment: (was: pktmtntunnel.xml)

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.3 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

-- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira

Bela Ban (JIRA)

5:42 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban resolved JGRP-1165. ---------------------------- Resolution: Won't Fix No comment after Nov 2011, and I wasn't ablt to reproduce this. Feel free to re-open if you can reproduce it under 3.x.

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.3 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

5:44 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban updated JGRP-1165: --------------------------- Fix Version/s: 3.2 (was: 3.3)

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.2 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

5:44 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban reopened JGRP-1165: ----------------------------

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.2 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

Bela Ban (JIRA)

5:44 a.m.

New subject: [JBoss JIRA] (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

[ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.... ] Bela Ban resolved JGRP-1165. ---------------------------- Resolution: Cannot Reproduce Bug

...

Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer ------------------------------------------------------------------------------------------------- Key: JGRP-1165 URL: https://issues.jboss.org/browse/JGRP-1165 Project: JGroups Issue Type: Bug Affects Versions: 2.8, 2.9, 2.12.1 Reporter: vivek v Assignee: Bela Ban Fix For: 3.2 Attachments: pktmtntunnel.xml There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view, if(checkSelfInclusion(mbrs) == false) { if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it"); return; } Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example, 1) 3 nodes all with same view - V1 {n1, n2, n3} 2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3} - n2 discards the suspect message from n1 as FD_SOCK is still connected 3) n2 receives this view, but discards it due to the logic in GMS 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2). 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2. Now all nodes got the same view The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

4837

days inactive

5781

days old

jboss-jira@lists.jboss.org

Manage subscription

23 comments

4 participants

tags (0)

participants (4)

Bela Ban (Commented) (JIRA)
Bela Ban (JIRA)
Bela Ban (Updated) (JIRA)
vivek v (JIRA)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBoss JIRA] Created: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer