[jboss-jira] [JBoss JIRA] (JBMESSAGING-1902) JBMessaging cluster stops working after a node gets suspended (kill -STOP) and unsuspended (kill -CONT) after few minutes

Yong Hao Gao (Commented) (JIRA) jira-events at lists.jboss.org
Tue Oct 25 04:51:45 EDT 2011


    [ https://issues.jboss.org/browse/JBMESSAGING-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637108#comment-12637108 ] 

Yong Hao Gao commented on JBMESSAGING-1902:
-------------------------------------------

I'd suggest we enlarge the NodeRefreshInterval value to a reasonable value to solve this issue.

Using kill -STOP, the node is put to a 'frozen' state. The other members of the cluster will get jgroups notification of its leaving, but they haven't other way to know the node's real state than watching its state in the DB cluster table. However the suspended node cannot update its state during the suspension. Based on current implementation, other members will eventually think this node dead and perform failover for it.

As we know performing failover for a live node may cause duplicated messages so the failover should never happen. 


                
> JBMessaging cluster stops working after a node gets suspended (kill -STOP) and unsuspended (kill -CONT) after few minutes
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: JBMESSAGING-1902
>                 URL: https://issues.jboss.org/browse/JBMESSAGING-1902
>             Project: JBoss Messaging
>          Issue Type: Feature Request
>          Components: JMS Clustering
>    Affects Versions: 1.4.8.SP3
>         Environment: JBoss EAP 5.1.0
>            Reporter: Tom Ross
>            Assignee: Yong Hao Gao
>            Priority: Blocker
>
> This is a very simple cases where two node JBoss Messaging cluster stops working after one node gets suspended (kill -STOP) for few minutes. It would appear that when the node wakes up from the suspension it misses jgroups notifications and carries on as if nothing ever happened. Meantime while the node was suspended the remaining node performed failover and deleted the suspended node from the jbm_cluster table.  
> How to reproduce the problem.
> Create a two node cluster using all profile.
> start node 1
> start node 2
> after both nodes are running
> find out pid of the JVM hosting node 2 and pid of the shell process
> ps -fu ${user-name} | grep java
> kill -STOP pid-jvm
> kill -STOP shell running jvm
> wait 5 minutes
> kill -CONT shell running jvm
> kill -CONT pid-jvm
> Observe that 
> node 1 has noticed that node 2 is missing and performed failover
> node 2 has been remved from JBM_CLUSTER table.
> JGroups is functioning normally
> JBM cluster is not working node 2 is not part of the cluster any longer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list