]
Bela Ban updated JGRP-1165:
---------------------------
Attachment: (was: pktmtntunnel.xml)
Out-of-sync views in the cluster causes NAKACK issues and invalid
node list at application layer
-------------------------------------------------------------------------------------------------
Key: JGRP-1165
URL:
https://issues.jboss.org/browse/JGRP-1165
Project: JGroups
Issue Type: Bug
Affects Versions: 2.8, 2.9, 2.12.1
Reporter: vivek v
Assignee: Bela Ban
Fix For: 3.3
Attachments: pktmtntunnel.xml
There is a logic in GMS (in the installView(..) method) where it checks whether the node
itself is in the view or not, if not then just discard the view,
if(checkSelfInclusion(mbrs) == false) {
if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view
" + new_view + "; discarding it");
return;
}
Now, the problem /w this logic is that the node will remain /w the old view and when
trying to send message to the members in the old view the messages would be discarded /w
NAKACK as this node won't be there in their new view. So here is an example,
1) 3 nodes all with same view - V1 {n1, n2, n3}
2) n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2
{n1, n3}
- n2 discards the suspect message from n1 as FD_SOCK is still connected
3) n2 receives this view, but discards it due to the logic in GMS
4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3
will discard messages from n2 /w NAKACK as it's not in their view (V2).
5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1,
n2,n3} - joining V1 and V2. Now all nodes got the same view
The problem is on n2 the application layer will never know that it can't talk to n1
and n3 - thus, the RPC calls will fail during the time the nodes had different views.
I would assume if a node gets a view, which doesn't have itself in it - it should
drop all the nodes that are in that new view. So, basically we will create two new
subgroups. This way we won't discard messages from each other. The application layer
needs to know at all times what nodes can it talk to.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: