[jboss-jira] [JBoss JIRA] Updated: (JGRP-1165) Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer

Thu Sep 8 19:59:26 EDT 2011

     [ https://issues.jboss.org/browse/JGRP-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

vivek v updated JGRP-1165:
--------------------------

    Attachment: pktmtntunnel.xml


> Out-of-sync views in the cluster causes NAKACK issues and invalid node list at application layer 
> -------------------------------------------------------------------------------------------------
>
>                 Key: JGRP-1165
>                 URL: https://issues.jboss.org/browse/JGRP-1165
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.8, 2.9, 2.12.1
>            Reporter: vivek v
>            Assignee: Bela Ban
>             Fix For: 2.10, 2.12.2
>
>         Attachments: pktmtntunnel.xml
>
>
> There is a logic in GMS (in the installView(..) method) where it checks whether the node itself is in the view or not, if not then just discard the view,
> if(checkSelfInclusion(mbrs) == false) {
>             if(log.isWarnEnabled()) log.warn(local_addr + ": not member of view " + new_view + "; discarding it");
>             return;
>  }
> Now, the problem /w this logic is that the node will remain /w the old view and when trying to send message to the members in the old view the messages would be discarded /w NAKACK as this node won't be there in their new view. So here is an example,
> 1) 3 nodes all with same view - V1 {n1, n2, n3}
> 2)  n1 (coordinator) suspects (due to missing heartbeat) n2 and publishes new view - V2 {n1, n3}
>          -  n2 discards the suspect message from n1 as FD_SOCK is still connected
> 3) n2 receives this view, but discards it due to the logic in GMS
> 4) n2 still keeps the old view V1 and continue to send messages to n1 and n3. n1 and n3 will discard messages from n2 /w NAKACK as it's not in their view (V2).
> 5) After few minutes (could be 10-15 minutes or more) n1 will publish a merge view V3(n1, n2,n3} - joining V1 and V2.  Now all nodes got the same view
> The problem is on n2 the application layer will never know that it can't talk to n1 and n3 - thus, the RPC calls will fail during the time the nodes had different views. 
> I would assume if a node gets a view, which doesn't have itself in it - it should drop all the nodes that are in that new view. So, basically we will create two new subgroups. This way we won't discard messages from each other. The application layer needs to know at all times what nodes can it talk to.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira