[jboss-user] [JBoss Messaging] - Re: Failover on clustered queues causes thread leak

Wed Mar 26 06:43:59 EDT 2008

Thank you for your quick reply!

We have a cluster of 2 nodes, and have killed/started 1 node a couple of times. We did the following scenario:

1. Start node 1
2. Start node 2 after node 1 has started completely
3. Once node 2 has started completely, 'kill -9' node 2 (first failover)
4. Start node 2
5. Once node 2 has started completely, 'kill -9' node 2 (second failover)
6. etc.

At the moment we have no JMS connections or sessions at either node, because we're testing the failover scenario 'pure' without any additional load.
Please note that the JMS functionality as such works very fine from the client perspective in the clustered setup (sending, receiving, reconnecting once a node a gone, etc.). Our problem at the moment only concerns the messaging-node failover with the mentioned thread-increase.

After killing node2 several times (depending on the amount of available OS memory) we get the following exception (stack trace varies, but all are OOME-native-thread):

  | 2008-03-26 10:23:03,967 ERROR [org.jgroups] uncaught exception in Thread[ServerSocket acceptor thread (channel=DefaultPartition-EntityCache),5,JGroups threa
  | ds] (thread group=org.jgroups.util.Util$1[name=JGroups threads,maxpri=10] )
  | java.lang.OutOfMemoryError: unable to create new native thread
  |         at java.lang.Thread.start0(Native Method)
  |         at java.lang.Thread.start(Thread.java:574)
  |         at org.jgroups.protocols.FD_SOCK$ServerSocketHandler.run(FD_SOCK.java:1089)
  |         at java.lang.Thread.run(Thread.java:595)
  | 

Here's an excerpt from our log from node 1 from the moment where it has started, until when the failover has finished:

  | 2008-03-26 10:01:46,914 INFO  [org.jboss.system.server.Server] JBoss (MX MicroKernel) [4.2.2.GA (build: SVNTag=JBoss_4_2_2_GA date=200710221139)] Started in 59s:380ms
  | 2008-03-26 10:02:45,989 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45958|1] [192.168.86.201:45958, 192.168.86.202:45969]
  | 2008-03-26 10:02:46,041 INFO  [org.jboss.cache.statetransfer.StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
  | 2008-03-26 10:02:53,732 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster view for partition devel (id: 1, delta: 1) : [192.168.86.201:1099, 192.168.86.202:1099]
  | 2008-03-26 10:02:53,734 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am (192.168.86.201:1099) received membershipChanged event:
  | 2008-03-26 10:02:53,734 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead members: 0 ([])
  | 2008-03-26 10:02:53,734 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New Members : 1 ([192.168.86.202:1099])
  | 2008-03-26 10:02:53,734 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All Members : 2 ([192.168.86.201:1099, 192.168.86.202:1099])
  | 2008-03-26 10:02:56,742 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45964|1] [192.168.86.201:45964, 192.168.86.202:45972]
  | 2008-03-26 10:02:59,322 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45966|1] [192.168.86.201:45966, 192.168.86.202:45974]
  | 2008-03-26 10:03:35,464 WARN  [org.jgroups.blocks.ConnectionTable] peer closed connection, trying to re-send msg
  | 2008-03-26 10:03:35,464 ERROR [org.jgroups.blocks.ConnectionTable] 2nd attempt to send data failed too
  | 2008-03-26 10:03:35,465 WARN  [org.jgroups.blocks.ConnectionTable] peer closed connection, trying to re-send msg
  | 2008-03-26 10:03:35,465 ERROR [org.jgroups.blocks.ConnectionTable] 2nd attempt to send data failed too
  | 2008-03-26 10:03:35,466 WARN  [org.jgroups.blocks.ConnectionTable] peer closed connection, trying to re-send msg
  | 2008-03-26 10:03:35,466 ERROR [org.jgroups.blocks.ConnectionTable] 2nd attempt to send data failed too
  | 2008-03-26 10:03:35,633 ERROR [org.jboss.remoting.transport.socket.SocketClientInvoker] Got marshalling exception, exiting
  | java.net.SocketException: end of file
  |         at org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:604)
  |         at org.jboss.remoting.transport.bisocket.BisocketClientInvoker.transport(BisocketClientInvoker.java:418)
  |         at org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:122)
  |         at org.jboss.remoting.ConnectionValidator.doCheckConnection(ConnectionValidator.java:133)
  |         at org.jboss.remoting.ConnectionValidator.run(ConnectionValidator.java:308)
  |         at java.util.TimerThread.mainLoop(Timer.java:512)
  |         at java.util.TimerThread.run(Timer.java:462)
  | 2008-03-26 10:03:36,972 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] Suspected member: 192.168.86.202:8800
  | 2008-03-26 10:03:37,031 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster view for partition devel (id: 2, delta: -1) : [192.168.86.201:1099]
  | 2008-03-26 10:03:37,032 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am (192.168.86.201:1099) received membershipChanged event:
  | 2008-03-26 10:03:37,032 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead members: 1 ([192.168.86.202:1099])
  | 2008-03-26 10:03:37,032 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New Members : 0 ([])
  | 2008-03-26 10:03:37,032 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All Members : 1 ([192.168.86.201:1099])
  | 2008-03-26 10:03:37,039 INFO  [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging is failing over for failed node 2. If there are many messages to reload this may take some time...
  | 2008-03-26 10:03:37,043 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45958|2] [192.168.86.201:45958]
  | 2008-03-26 10:03:37,047 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45964|2] [192.168.86.201:45964]
  | 2008-03-26 10:03:37,048 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45966|2] [192.168.86.201:45966]
  | 2008-03-26 10:03:37,511 WARN  [org.jboss.remoting.LeasePinger] LeasePinger[SocketClientInvoker[19985cb, bisocket://192.168.86.202:4457](5c4o2e9-na0je7-fe9q826g-1-fe9qb3hb-j)] failed to ping to server: Can not get connection to server. Problem establishing socket connection for InvokerLocator [bisocket://192.168.86
  | .202:4457/?JBM_clientMaxPoolSize=200&clientLeasePeriod=10000&clientSocketClass=org.jboss.jms.client.remoting.ClientSocketWrapper&dataType=jms&
  | marshaller=org.jboss.jms.wireformat.JMSWireFormat&numberOfCallRetries=1&numberOfRetries=10&pingFrequency=214748364&pingWindowFactor=10&socket.check_connection=false&timeout=
  | 0&unmarshaller=org.jboss.jms.wireformat.JMSWireFormat]
  | 2008-03-26 10:03:37,985 WARN  [org.jboss.remoting.Client] unable to remove remote callback handler: Can not get connection to server. Problem establishing socket connection for InvokerLocator [bisocket://192.168.86.202:4457/?JBM_clientMaxPoolSize=200&clientLeasePeriod=10000&clientSocketClass=org.jboss.jms.client.r
  | emoting.ClientSocketWrapper&dataType=jms&marshaller=org.jboss.jms.wireformat.JMSWireFormat&numberOfCallRetries=1&numberOfRetries=10&pingFrequency=214748364&pingWindowFactor=10&
  | socket.check_connection=false&timeout=0&unmarshaller=org.jboss.jms.wireformat.JMSWireFormat]
  | 2008-03-26 10:03:37,985 WARN  [org.jboss.remoting.LeasePinger] LeasePinger[SocketClientInvoker[19985cb, bisocket://192.168.86.202:4457](5c4o2e9-na0je7-fe9q826g-1-fe9qb3hb-j)] failed sending disconnect for client lease for client with session ID 5c4o2e9-na0je7-fe9q826g-1-fe9qb3ha-h
  | 2008-03-26 10:03:37,986 ERROR [org.jboss.remoting.MicroRemoteClientInvoker] error shutting down lease pinger
  | 2008-03-26 10:03:37,990 INFO  [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging failover completed
  | 2008-03-26 10:03:57,382 WARN  [org.jboss.jms.server.connectionmanager.SimpleConnectionManager] ConnectionManager[14790be] cannot look up remoting session ID 5c4o2e9-lf5156-fe9q9x46-1-fe9qb3cg-e
  | 2008-03-26 10:03:57,382 WARN  [org.jboss.jms.server.connectionmanager.SimpleConnectionManager] A problem has been detected with the connection to remote client 5c4o2e9-lf5156-fe9q9x46-1-fe9qb3cg-e, jmsClientID=null. It is possible the client has exited without closing its connection(s) or the network has failed. All connection resources corresponding to that client process will now be removed.
  | 

After we start node 2 again, the log on node 1 looks like this:

  | 2008-03-26 10:04:24,333 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45958|3] [192.168.86.201:45958, 192.168.86.202:45976]
  | 2008-03-26 10:04:24,370 INFO  [org.jboss.cache.statetransfer.StateTransferGenerator_140] returning the state for tree rooted in /(1024 bytes)
  | 2008-03-26 10:04:31,502 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster view for partition devel (id: 3, delta: 1) : [192.168.86.201:1099, 192.168.86.202:1099]
  | 2008-03-26 10:04:31,502 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am (192.168.86.201:1099) received membershipChanged event:
  | 2008-03-26 10:04:31,502 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead members: 0 ([])
  | 2008-03-26 10:04:31,502 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New Members : 1 ([192.168.86.202:1099])
  | 2008-03-26 10:04:31,502 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All Members : 2 ([192.168.86.201:1099, 192.168.86.202:1099])
  | 2008-03-26 10:04:34,472 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45964|3] [192.168.86.201:45964, 192.168.86.202:45979]
  | 2008-03-26 10:04:36,692 INFO  [org.jboss.cache.TreeCache] viewAccepted(): [192.168.86.201:45966|3] [192.168.86.201:45966, 192.168.86.202:45981]
  | 

In the meantime, we've configured all our EQs and DLQs as non-clustered, reducing the thread-increase to ~75 new threads per failover instead of ~200. This allows us to failover a couple of extra times before the OOME pops up, but still doesn't solve the real problem...

We've also tried to decrease the initial thread stack size with the -Xss JVM parameter, but this doesn't appear to make a difference on our 2.4 kernel wrt the maximum number of threads the JVM can start/keep alive.

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138923#4138923

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4138923