[
http://jira.jboss.com/jira/browse/JBAS-4313?page=all ]
Brian Stansberry resolved JBAS-4313.
------------------------------------
Resolution: Done
Summary of fixes for this (AS 5 only):
1) Mechanism by which ClusterPartition detects duplicate instances of the same logical
node in the view has been cleaned up. Now based on comparison of ClusterNodeImpl.id,
which comes either from "additional_data" passed to the JGroups channel before
connecting, or from the InetAddress/port on which JGroups is listening. If the
"additional_data" approach is used, it will still be the traditional form:
${jboss.bind.address}:NamingServicePort
e.g.
192.168.0.10:1099
2) Whether the "additional_data" is set is controlled by a boolean
"assignLogicalAddresses" property in the JChannelFactory bean
(deploy/cluster/jgroups-channelfactory.sar/META-INF/multiplexer-beans.xml). Default is
true. One use case for turning this off would be if you start JBoss with -b 0.0.0.0, in
which case the additional_data for all nodes would be an identical 0.0.0.0:1099. Another
approach to solving the same problem would be to assign a unique value to the
JChannelFactory.nodeName on each node. Setting JChannelFactory.assignLogicalAddresses is
not recommended, as it is quite typical for a node to restart and end up with JGroups
listening on a different port. This would prevent detection of the duplicate.
Cluster loses his master
------------------------
Key: JBAS-4313
URL:
http://jira.jboss.com/jira/browse/JBAS-4313
Project: JBoss Application Server
Issue Type: Bug
Security Level: Public(Everyone can see)
Components: Clustering
Affects Versions: JBossAS-4.0.2 Final, JBossAS-4.0.5.GA, JBossAS-4.0.4.GA,
JBossAS-4.2.0.CR1
Environment: Linux 2.6.x, Sun-JVM 1.5.0_11-b03, 1.6.0-b105, JBoss 4.0.2.final,
but the code which causes the problem seems to be present up to version 4.2.0.CR1, cluster
with two or more nodes.
Reporter: Bernd Köcke
Assigned To: Brian Stansberry
Fix For: JBossAS-5.0.0.Beta4
Attachments: ClusterPartition.java.patch
In a cluster with two nodes (node_a and node_b) under a high workload it is possible that
a restarted node, which is not the master, rejoins the cluster before the master
recognised that the old node has gone away. The result is, that the restarted node seems
to appear in the clusterview twice. The problem is the DistributedReplicantManagerImpl
(DRMI) and ClusterPartition. Via the setCurrentState-method of DRMI on node_b the
replicants-map is set, this map contains the old dead node node_bo. The key for the new
node_b (node_bn) and the old one is the same: node_b:1199. When the master recognised that
node_bo is dead he sends a new clusterview without the old node: node_a, node_b(n). But
because of the implementation in HAPartitionImpl.getDeadMembers and DRMI.membershipChanged
node_bo is never removed from the replicants map of node_bn's DRMI. If after this
node_a is restarted, node_bn is the new master. JGroups knows this, but not the DRMI not
and because of the old node in the replicants-map he returns 'false' from
'isMasterReplica' for every service. The restarted node_a is not the first in the
clusterview and so there is no master. This situation is stable until node_bn is
restarted.
The cause of the problem is the method ClusterPartition.generateUniqueNodeName. If the
JNDI-Subsystem is working, the method generates this nodename: <server
name>:<JNDI port>. This is not unique across a node restart. To 'solve'
the problem I commented out the return statement for the JNDI-based nodename. Then the
method generates a name like this: <server name>:<generated UID>. This is set
as additional data in JGroups and in turn used as key in the various maps in DRMI. With
this fix node_bo is removed from the replicants map. But the clusterview looks a little
bit ugly because of the UID. May be that others want to stay with the JNDI-Port. So my
suggestion would be to make it a configuration option, to use always the UID instead of
the JNDI-Port and the default can be the JNDI-Port. Because it could be that someone uses
the nodenames of the clusterview to determine the JNDI-Port of the other members.
But by changing this, I found a deadlock condition in DRMI. I will post another bugreport
for this, it seems to affect only version 4.0.2.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira