[jboss-jira] [JBoss JIRA] (JBMESSAGING-1902) JBMessaging cluster stops working after a node gets suspended (kill -STOP) and unsuspended (kill -CONT) after few minutes

Yong Hao Gao (Commented) (JIRA) jira-events at lists.jboss.org
Wed Oct 26 22:47:45 EDT 2011


    [ https://issues.jboss.org/browse/JBMESSAGING-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637855#comment-12637855 ] 

Yong Hao Gao commented on JBMESSAGING-1902:
-------------------------------------------

Makes sense. Another potential problem is the message sucker's connections. The connection will time out at remoting and disconnected. But if it retries the connection when the node is un-paused, message will be sucked over to other nodes. These also brings contention between sucking and failover.


                
> JBMessaging cluster stops working after a node gets suspended (kill -STOP) and unsuspended (kill -CONT) after few minutes
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: JBMESSAGING-1902
>                 URL: https://issues.jboss.org/browse/JBMESSAGING-1902
>             Project: JBoss Messaging
>          Issue Type: Feature Request
>          Components: JMS Clustering
>    Affects Versions: 1.4.8.SP3
>         Environment: JBoss EAP 5.1.0
>            Reporter: Tom Ross
>            Assignee: Yong Hao Gao
>            Priority: Blocker
>
> This is a very simple cases where two node JBoss Messaging cluster stops working after one node gets suspended (kill -STOP) for few minutes. It would appear that when the node wakes up from the suspension it misses jgroups notifications and carries on as if nothing ever happened. Meantime while the node was suspended the remaining node performed failover and deleted the suspended node from the jbm_cluster table.  
> How to reproduce the problem.
> Create a two node cluster using all profile.
> start node 1
> start node 2
> after both nodes are running
> find out pid of the JVM hosting node 2 and pid of the shell process
> ps -fu ${user-name} | grep java
> kill -STOP pid-jvm
> kill -STOP shell running jvm
> wait 5 minutes
> kill -CONT shell running jvm
> kill -CONT pid-jvm
> Observe that 
> node 1 has noticed that node 2 is missing and performed failover
> node 2 has been remved from JBM_CLUSTER table.
> JGroups is functioning normally
> JBM cluster is not working node 2 is not part of the cluster any longer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


More information about the jboss-jira mailing list