[
https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin....
]
vivek v reopened JGRP-1165:
---------------------------
Bela,
We are currently on 2.12.1 and I see see this issue where a node is getting isolated and
never able to join back. Here is what's happening,
1) All nodes on view 1 V1{n1,n2,n3}- everything is fine here
2) Due to some network issues new view is created on n1 and n3, V2{n3,n1}
3) n2 gets a merge view, but rejects it saying it's not in the merge view,
{noformat}
2011-09-04 10:07:08,399 WARN [Incoming-9,204.99.64.103_group,probe_10.112.1.130:4576] GMS
- probe_10.112.1.130:4576: not member of view MergeView::[probe_10.24.1.135:4576|17]
[probe_10.24.1.135:4576, probe_10.84.9.149:4576, probe_204.99.64.105:4576,
probe_204.99.64.108:4576, probe_10.36.12.137:4576, probe_10.103.1.130:4576,
probe_10.126.1.106:4576, collector_204.99.64.104:4576, probe_10.36.12.138:4576,
probe_204.99.64.107:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576,
probe_10.84.15.15:4576], subgroups=[[probe_10.24.1.135:4576|16] [probe_10.24.1.135:4576,
probe_10.84.9.149:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576,
probe_10.36.12.137:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576,
collector_204.99.64.104:4576, probe_10.36.12.138:4576, probe_204.99.64.107:4576,
manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.84.15.15:4576],
[probe_10.24.1.135:4576|16] [manager_204.99.64.103:4576, probe_204.99.64.105:4576,
probe_204.99.64.106:4576, probe_204.99.64.108:4576, probe_10.103.1.130:4576,
probe_10.126.1.106:4576, probe_10.84.9.149:4576, probe_10.84.15.15:4576,
probe_10.36.12.137:4576, probe_10.36.12.138:4576, probe_10.24.1.135:4576,
probe_204.99.64.107:4576]]; discarding it
{noformat}
4) n1 discards message from n2 with NAKACK,
{noformat}
2011-09-08 16:29:07,122 WARN [OOB-5,204.99.64.103_group,manager_204.99.64.103:4576]
NAKACK - manager_204.99.64.103:4576: dropped message from probe_10.112.1.130:4576 (not in
table [probe_10.24.1.135:4576, probe_10.36.12.137:4576, probe_10.84.9.149:4576,
probe_10.84.15.15:4576, collector_204.99.64.104:4576, probe_10.103.1.130:4576,
probe_204.99.64.105:4576, probe_204.99.64.108:4576, manager_204.99.64.103:4576,
probe_204.99.64.106:4576, probe_10.126.1.106:4576, probe_204.99.64.107:4576,
probe_10.36.12.138:4576]), view=MergeView::[probe_10.24.1.135:4576|17]
[probe_10.24.1.135:4576, probe_10.84.9.149:4576, probe_204.99.64.105:4576,
probe_204.99.64.108:4576, probe_10.36.12.137:4576, probe_10.103.1.130:4576,
probe_10.126.1.106:4576, collector_204.99.64.104:4576, probe_10.36.12.138:4576,
probe_204.99.64.107:4576, manager_204.99.64.103:4576, probe_204.99.64.106:4576,
probe_10.84.15.15:4576], subgroups=[[probe_10.24.1.135:4576|16] [probe_10.24.1.135:4576,
probe_10.84.9.149:4576, probe_204.99.64.105:4576, probe_204.99.64.108:4576,
probe_10.36.12.137:4576, probe_10.103.1.130:4576, probe_10.126.1.106:4576,
collector_204.99.64.104:4576, probe_10.36.12.138:4576, probe_204.99.64.107:4576,
manager_204.99.64.103:4576, probe_204.99.64.106:4576, probe_10.84.15.15:4576],
[probe_10.24.1.135:4576|16] [manager_204.99.64.103:4576, probe_204.99.64.105:4576,
probe_204.99.64.106:4576, probe_204.99.64.108:4576, probe_10.103.1.130:4576,
probe_10.126.1.106:4576, probe_10.84.9.149:4576, probe_10.84.15.15:4576,
probe_10.36.12.137:4576, probe_10.36.12.138:4576, probe_10.24.1.135:4576,
probe_204.99.64.107:4576]]
{noformat}
5) Now n2 remains isolated is never able to join back.
Attached is our protocol stack (we use tunneling with two Gossip Routers for load
balancing and redundancy).
Re-opening for Bela to look at this logic again.
Out-of-sync views in the cluster causes NAKACK issues and invalid
node list at application layer
-------------------------------------------------------------------------------------------------
Key: JGRP-1165
URL:
https://issues.jboss.org/browse/JGRP-1165
Project: JGroups
Issue Type: Bug
Affects Versions: 2.8, 2.9
Reporter: vivek v
Assignee: Bela Ban
Fix For: 2.10
There is a logic in GMS (in the installView(..) method) where it checks whether the node
itself is in the view or not, if not then just discard the view,
if(checkSelfInclusion(mbrs) == false) {
if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view
" + new_view + "; discarding it");
return;
}
Now, the problem /w this logic is that the node will remain /w the old view and when
trying to send message to the members in the old view the messages would be discarded /w
NAKACK as this node won't be there in their new view. So here is an example,
1) 3 nodes all with same view - V1 {n1, n2, n3}
2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2
{n1, n3}
- n2 discards the suspect message from n1 as FD_SOCK is still connected
3) n2 receives this view, but discards it due to the logic in GMS
4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3
will discard messages from n2 /w NAKACK as it's not in their view (V2).
5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1,
n2,n3} - joining V1 and V2. Now all nodes got the same view
The problem is on n2 the application layer will never know that it can't talk to n1
and n3 - thus, the RPC calls will fail during the time the nodes had different views.
I would assume if a node gets a view, which doesn't have itself in it - it should
drop all the nodes that are in that new view. So, basically we will create two new
subgroups. This way we won't discard messages from each other. The application layer
needs to know at all times what nodes can it talk to.
--
This message is automatically generated by JIRA.
For more information on JIRA, see:
http://www.atlassian.com/software/jira