[
https://issues.jboss.org/browse/JBMESSAGING-1902?page=com.atlassian.jira....
]
Yong Hao Gao commented on JBMESSAGING-1902:
-------------------------------------------
I'd suggest we enlarge the NodeRefreshInterval value to a reasonable value to solve
this issue.
Using kill -STOP, the node is put to a 'frozen' state. The other members of the
cluster will get jgroups notification of its leaving, but they haven't other way to
know the node's real state than watching its state in the DB cluster table. However
the suspended node cannot update its state during the suspension. Based on current
implementation, other members will eventually think this node dead and perform failover
for it.
As we know performing failover for a live node may cause duplicated messages so the
failover should never happen.
JBMessaging cluster stops working after a node gets suspended (kill
-STOP) and unsuspended (kill -CONT) after few minutes
-------------------------------------------------------------------------------------------------------------------------
Key: JBMESSAGING-1902
URL:
https://issues.jboss.org/browse/JBMESSAGING-1902
Project: JBoss Messaging
Issue Type: Feature Request
Components: JMS Clustering
Affects Versions: 1.4.8.SP3
Environment: JBoss EAP 5.1.0
Reporter: Tom Ross
Assignee: Yong Hao Gao
Priority: Blocker
This is a very simple cases where two node JBoss Messaging cluster stops working after
one node gets suspended (kill -STOP) for few minutes. It would appear that when the node
wakes up from the suspension it misses jgroups notifications and carries on as if nothing
ever happened. Meantime while the node was suspended the remaining node performed failover
and deleted the suspended node from the jbm_cluster table.
How to reproduce the problem.
Create a two node cluster using all profile.
start node 1
start node 2
after both nodes are running
find out pid of the JVM hosting node 2 and pid of the shell process
ps -fu ${user-name} | grep java
kill -STOP pid-jvm
kill -STOP shell running jvm
wait 5 minutes
kill -CONT shell running jvm
kill -CONT pid-jvm
Observe that
node 1 has noticed that node 2 is missing and performed failover
node 2 has been remved from JBM_CLUSTER table.
JGroups is functioning normally
JBM cluster is not working node 2 is not part of the cluster any longer.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira