Cluster looses his master
-------------------------
Key: JBAS-4313
URL:
http://jira.jboss.com/jira/browse/JBAS-4313
Project: JBoss Application Server
Issue Type: Bug
Security Level: Public (Everyone can see)
Components: Clustering
Affects Versions: JBossAS-4.2.0.CR1, JBossAS-4.0.5.GA, JBossAS-4.0.4.GA, JBossAS-4.0.2
Final
Environment: Linux 2.6.x, Sun-JVM 1.5.0_11-b03, 1.6.0-b105, JBoss 4.0.2.final,
but the code which causes the problem seems to be present up to version 4.2.0.CR1, cluster
with two or more nodes.
Reporter: Bernd Köcke
Assigned To: Brian Stansberry
In a cluster with two nodes (node_a and node_b) under a high workload it is possible that
a restarted node, which is not the master, rejoins the cluster before the master
recognised that the old node has gone away. The result is, that the restarted node seems
to appear in the clusterview twice. The problem is the DistributedReplicantManagerImpl
(DRMI) and ClusterPartition. Via the setCurrentState-method of DRMI on node_b the
replicants-map is set, this map contains the old dead node node_bo. The key for the new
node_b (node_bn) and the old one is the same: node_b:1199. When the master recognised that
node_bo is dead he sends a new clusterview without the old node: node_a, node_b(n). But
because of the implementation in HAPartitionImpl.getDeadMembers and DRMI.membershipChanged
node_bo is never removed from the replicants map of node_bn's DRMI. If after this
node_a is restarted, node_bn is the new master. JGroups knows this, but not the DRMI not
and because of the old node in the replicants-map he returns 'false' from
'isMasterReplica' for every service. The restarted node_a is not the first in the
clusterview and so there is no master. This situation is stable until node_bn is
restarted.
The cause of the problem is the method ClusterPartition.generateUniqueNodeName. If the
JNDI-Subsystem is working, the method generates this nodename: <server
name>:<JNDI port>. This is not unique across a node restart. To 'solve'
the problem I commented out the return statement for the JNDI-based nodename. Then the
method generates a name like this: <server name>:<generated UID>. This is set
as additional data in JGroups and in turn used as key in the various maps in DRMI. With
this fix node_bo is removed from the replicants map. But the clusterview looks a little
bit ugly because of the UID. May be that others want to stay with the JNDI-Port. So my
suggestion would be to make it a configuration option, to use always the UID instead of
the JNDI-Port and the default can be the JNDI-Port. Because it could be that someone uses
the nodenames of the clusterview to determine the JNDI-Port of the other members.
But by changing this, I found a deadlock condition in DRMI. I will post another bugreport
for this, it seems to affect only version 4.0.2.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira