[jboss-jira] [JBoss JIRA] (JBMESSAGING-1902) JBMessaging cluster stops working after a node gets suspended (kill -STOP) and unsuspended (kill -CONT) after few minutes
Yong Hao Gao (Commented) (JIRA)
jira-events at lists.jboss.org
Mon Oct 24 04:04:45 EDT 2011
[ https://issues.jboss.org/browse/JBMESSAGING-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636781#comment-12636781 ]
Yong Hao Gao commented on JBMESSAGING-1902:
-------------------------------------------
I think the key problem is that the suspended node doesn't update its timestamp at all while still alive. We need to find a way to let other nodes aware of this situation.
> JBMessaging cluster stops working after a node gets suspended (kill -STOP) and unsuspended (kill -CONT) after few minutes
> -------------------------------------------------------------------------------------------------------------------------
>
> Key: JBMESSAGING-1902
> URL: https://issues.jboss.org/browse/JBMESSAGING-1902
> Project: JBoss Messaging
> Issue Type: Feature Request
> Components: JMS Clustering
> Affects Versions: 1.4.8.SP3
> Environment: JBoss EAP 5.1.0
> Reporter: Tom Ross
> Assignee: Yong Hao Gao
> Priority: Blocker
>
> This is a very simple cases where two node JBoss Messaging cluster stops working after one node gets suspended (kill -STOP) for few minutes. It would appear that when the node wakes up from the suspension it misses jgroups notifications and carries on as if nothing ever happened. Meantime while the node was suspended the remaining node performed failover and deleted the suspended node from the jbm_cluster table.
> How to reproduce the problem.
> Create a two node cluster using all profile.
> start node 1
> start node 2
> after both nodes are running
> find out pid of the JVM hosting node 2 and pid of the shell process
> ps -fu ${user-name} | grep java
> kill -STOP pid-jvm
> kill -STOP shell running jvm
> wait 5 minutes
> kill -CONT shell running jvm
> kill -CONT pid-jvm
> Observe that
> node 1 has noticed that node 2 is missing and performed failover
> node 2 has been remved from JBM_CLUSTER table.
> JGroups is functioning normally
> JBM cluster is not working node 2 is not part of the cluster any longer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list