[jboss-jira] [JBoss JIRA] Created: (JBMESSAGING-1854) ConcurrentModificationException in org.jboss.messaging.core.impl.postoffice.MessagingPostOffice in Clustered JMS deployment

Thu Mar 24 14:59:45 EDT 2011

ConcurrentModificationException in  org.jboss.messaging.core.impl.postoffice.MessagingPostOffice in Clustered JMS deployment
----------------------------------------------------------------------------------------------------------------------------

                 Key: JBMESSAGING-1854
                 URL: https://issues.jboss.org/browse/JBMESSAGING-1854
             Project: JBoss Messaging
          Issue Type: Bug
          Components: JMS Clustering
    Affects Versions: 1.4.7.GA
         Environment: Dell PowerEdge M1000e Chassis with 16 PowerEdgeM610 blades. Each blade has 2 Intel 2.40 GHz with 32GB of memory
Windows Server 2008 64-bit.
JRE 1.6.0_22
JBM 1.4.7 deployed in JBoss AS 5.1.0.GA
oracle-persistence-service.xml, on 3-blade Oracle RAC 11.2.0.2
            Reporter: Ryan Hochstetler

We recently changed how we start JBoss, and it has uncovered a concurrency problem in JBoss Messaging's clustering.
Previously, we started each of the 32 JBoss instances in serial.  Of course, you can imagine that this takes forever.  Recently, one of our integration engineers got RHQ working, so we created two startup groups.  One that contains just the first server.  He boots fully, and becomes HASingleton and JGroups coordinator on all channels, and then another group that contains the other 31 nodes.  The 31 other nodes now start mostly in parallel.

And that's when the ConcurrentModificationExceptions began.

[22 Mar 2011 21:01:38,982] [ERROR] [org.jboss.messaging.core.impl.postoffice.GroupMember] - Caught Exception in RequestHandler
java.util.ConcurrentModificationException
	at java.util.HashMap$HashIterator.nextEntry(Unknown Source)
	at java.util.HashMap$EntryIterator.next(Unknown Source)
	at java.util.HashMap$EntryIterator.next(Unknown Source)
	at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.findNodeIDForAddress(MessagingPostOffice.java:2289)
	at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.calculateFailoverMap(MessagingPostOffice.java:2225)
	at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.handleNodeJoined(MessagingPostOffice.java:1337)
	at org.jboss.messaging.core.impl.postoffice.JoinClusterRequest.execute(JoinClusterRequest.java:68)
	at org.jboss.messaging.core.impl.postoffice.GroupMember$ControlRequestHandler.handle(GroupMember.java:648)
	at org.jgroups.blocks.MessageDispatcher.handle(MessageDispatcher.java:616)
	at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:637)
	at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:545)
	at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:368)
	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:775)
	at org.jgroups.JChannel.up(JChannel.java:1336)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:454)
	at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:486)
	at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:153)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
	at org.jgroups.protocols.FC.up(FC.java:473)
	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:820)
	at org.jgroups.protocols.VIEW_SYNC.up(VIEW_SYNC.java:192)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:233)
	at org.jgroups.protocols.UNICAST.up(UNICAST.java:328)
	at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:895)
	at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:708)
	at org.jgroups.protocols.BARRIER.up(BARRIER.java:136)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:167)
	at org.jgroups.protocols.FD.up(FD.java:284)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:307)
	at org.jgroups.protocols.MERGE2.up(MERGE2.java:144)
	at org.jgroups.protocols.Discovery.up(Discovery.java:264)
	at org.jgroups.protocols.PING.up(PING.java:273)
	at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2315)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1249)
	at org.jgroups.protocols.TP.access$100(TP.java:49)
	at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1826)
	at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1805)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
[22 Mar 2011 21:01:38,983] [ERROR] [org.jgroups.blocks.RequestCorrelator] - error invoking method
java.lang.IllegalStateException
	at java.util.HashMap$HashIterator.nextEntry(Unknown Source)
	at java.util.HashMap$EntryIterator.next(Unknown Source)
	at java.util.HashMap$EntryIterator.next(Unknown Source)
	at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.findNodeIDForAddress(MessagingPostOffice.java:2289	atorg.jboss.messaging.core.impl.postoffice.MessagingPostOffice.calculateFailoverMap(MessagingPostOffice.java:2225)
	at org.jboss.messaging.core.impl.postoffice.MessagingPostOffice.handleNodeJoined(MessagingPostOffice.java:1337)
	at org.jboss.messaging.core.impl.postoffice.JoinClusterRequest.execute(JoinClusterRequest.java:68)
	at org.jboss.messaging.core.impl.postoffice.GroupMember$ControlRequestHandler.handle(GroupMember.java:648)
	at org.jgroups.blocks.MessageDispatcher.handle(MessageDispatcher.java:616)
	at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:637)
	at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:545)
	at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:368)
	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:775)
	at org.jgroups.JChannel.up(JChannel.java:1336)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:454)
	at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:486)
	at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:153)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
	at org.jgroups.protocols.FC.up(FC.java:473)
	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:820)
	at org.jgroups.protocols.VIEW_SYNC.up(VIEW_SYNC.java:192)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:233)
	at org.jgroups.protocols.UNICAST.up(UNICAST.java:328)
	at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:895)
	at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:708)
	at org.jgroups.protocols.BARRIER.up(BARRIER.java:136)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:167)
	at org.jgroups.protocols.FD.up(FD.java:284)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:307)
	at org.jgroups.protocols.MERGE2.up(MERGE2.java:144)
	at org.jgroups.protocols.Discovery.up(Discovery.java:264)
	at org.jgroups.protocols.PING.up(PING.java:273)
	at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2315)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1249)
	at org.jgroups.protocols.TP.access$100(TP.java:49)
	at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1826)
	at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1805)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

The CME logged at relatively the same time on several cluster nodes.  Two nodes attempted to join the cluster within 9 seconds of one another.  These two messages are from the same server:
[22 Mar 2011 21:01:27,142] [ INFO] [org.jboss.messaging.core.impl.postoffice.GroupMember] - New Members : 1 ([10.15.20.168:52397])
[22 Mar 2011 21:01:36,460] [ INFO] [org.jboss.messaging.core.impl.postoffice.GroupMember] - New Members : 1 ([10.15.20.166:58958])

MessagingPostOffice seems to be a singleton, per my heap dump.  It appears that MPO.handleNodeJoined() executes a put() and then iterates over nodeIDAddressMap (by means of calculateFailoverMap()).  If handleNodeJoined() were invoked by two JGroups threads concurrently, I can see how the CME would result.  nodeIdAddressMap is not a thread-safe collection, and does not appear to be guarded by anything.  I'm going to try to make this class thread-safe myself, since I have no delusions that you're interested in fixing this bug for me.  I assume/see that most of your attention is on HornetQ, but perhaps someone else can benefit from me documenting the problem.  I'll upload what I'm permitted by my company to disclose when I find a solid solution.  I'm hoping it's as simple as synchronizing handleNodeJoined().

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira