[jboss-jira] [JBoss JIRA] Created: (JBAS-4313) Cluster looses his master

Tue Apr 10 13:01:58 EDT 2007

Cluster looses his master
-------------------------

                 Key: JBAS-4313
                 URL: http://jira.jboss.com/jira/browse/JBAS-4313
             Project: JBoss Application Server
          Issue Type: Bug
      Security Level: Public (Everyone can see)
          Components: Clustering
    Affects Versions: JBossAS-4.2.0.CR1, JBossAS-4.0.5.GA, JBossAS-4.0.4.GA, JBossAS-4.0.2 Final
         Environment: Linux 2.6.x, Sun-JVM 1.5.0_11-b03, 1.6.0-b105, JBoss 4.0.2.final, but the code which causes the problem seems to be present up to version 4.2.0.CR1, cluster with two or more nodes.
            Reporter: Bernd Köcke
         Assigned To: Brian Stansberry

In a cluster with two nodes (node_a and node_b) under a high workload it is possible that a restarted node, which is not the master, rejoins the cluster before the master recognised that the old node has gone away. The result is, that the restarted node seems to appear in the clusterview twice. The problem is the DistributedReplicantManagerImpl (DRMI) and ClusterPartition. Via the setCurrentState-method of DRMI on node_b the replicants-map is set, this map contains the old dead node node_bo. The key for the new node_b (node_bn) and the old one is the same: node_b:1199. When the master recognised that node_bo is dead he sends a new clusterview without the old node: node_a, node_b(n). But because of the implementation in HAPartitionImpl.getDeadMembers and DRMI.membershipChanged node_bo is never removed from the replicants map of node_bn's DRMI. If after this node_a is restarted, node_bn is the new master. JGroups knows this, but not the DRMI not and because of the old node in the replicants-map he returns 'false' from 'isMasterReplica' for every service. The restarted node_a is not the first in the clusterview and so there is no master. This situation is stable until node_bn is restarted.

The cause of the problem is the method ClusterPartition.generateUniqueNodeName. If the JNDI-Subsystem is working, the method generates this nodename: <server name>:<JNDI port>. This is not unique across a node restart. To 'solve' the problem I commented out the return statement for the JNDI-based nodename. Then the method generates a name like this: <server name>:<generated UID>. This is set as additional data in JGroups and in turn used as key in the various maps in DRMI. With this fix node_bo is removed from the replicants map. But the clusterview looks a little bit ugly because of the UID. May be that others want to stay with the JNDI-Port. So my suggestion would be to make it a configuration option, to use always the UID instead of the JNDI-Port and the default can be the JNDI-Port. Because it could be that someone uses the nodenames of the clusterview to determine the JNDI-Port of the other members.

But by changing this, I found a deadlock condition in DRMI. I will post another bugreport for this, it seems to affect only version 4.0.2.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira