[jboss-user] [JBoss Messaging] - Failover on clustered queues causes thread leak

Wed Mar 26 05:46:33 EDT 2008

Hi,

We have a setup of 2 clustered JBoss servers with clustered messaging with ~150 clustered queues (including EQs and DLQs). If we 'kill -9' one of the nodes, the failover mechanism kicks in, but it creates ~200 new threads which are all in a WAITING state. These threads keep waiting forever, and on every failover this happens again (thus, every failover increases the active thread count of the surviving node by ~200).

Since we still run on a Linux 2.4 kernel, with a practical upper thread count limit of ~600, this is a big problem because if a single node fails for 1 or 2 times, the 'surviving' node dies as a consequence, basically eliminating the concept of a failover mechanism.

We would be very thankful for suggestions on how we can fix this, what appears to be a thread leak.

We have observed already the following things:

  | Active thread count before failover: ~150
  | Active thread count after 1 failover: ~350
  | Active thread count after 2 failovers: ~550
  | The active thread count does not decrease over time (it seems the threads are blocked forever)
  | Failover scenario without clustered JBoss Messaging is without any apparent thread leak problem
  | Failover scenario with only 3 clustered queues is with the thread leak problem, but at a much lower rate (~5 threads per failover) so it appears that the thread leak is linear with the number of clustered queues.
  | 
  | 
  | At the moment it is not an option to upgrade our kernel to 2.6, which would in fact only reduce the frequency of the symptoms, but not fix the problem. Also, we cannot use a multicast protocol stack because it is not supported on our production environment.
  | 
  | Here's our configuration for both clsutered messaging nodes, which are dedicated messaging servers (thus, no other custom applications have been deployed):
  | 
  | 
  |   | JBoss AS 4.2.2.GA
  |   | JBoss Messaging 1.4.0sp3
  |   | JGroups 2.4.1 SP-4 
  |   | 
  | 
  | After a failover, a lot of increasingly numbered threads are lying around that look like this (only the thread-name-number differs, e.g. after 1 failover it goes to Thread-274, after 2 failovers to Thread-473):
  | 
  | 
  |   |                 Thread: Thread-50 : priority:5, demon:true, threadId:268, threadState:WAITING, lockName:java.lang.Object at 62685f
  |   | 
  |   |                     java.lang.Object.wait(Native Method)
  |   |                     java.lang.Object.wait(Object.java:474)
  |   |                     EDU.oswego.cs.dl.util.concurrent.LinkedQueue.take(LinkedQueue.java:122)
  |   |                     EDU.oswego.cs.dl.util.concurrent.QueuedExecutor$RunLoop.run(QueuedExecutor.java:83)
  |   |                     java.lang.Thread.run(Thread.java:595)
  |   | 
  |   | 
  | 
  | Here's the output of the JMX bean jboss.system/ServerInfo of the surviving node after 2 consequetive failovers:
  | 
  | 
  |   | HostAddress  		java.lang.String  	R 	192.168.86.9  	MBean Attribute.
  |   | AvailableProcessors 	java.lang.Integer 	R 	1 		MBean Attribute.
  |   | OSArch 			java.lang.String 	R 	i386 		MBean Attribute.
  |   | OSVersion 		java.lang.String 	R 	2.4.27-vmware-k7-nosmp 	MBean Attribute.
  |   | HostName 		java.lang.String 	R 	xxx 		MBean Attribute.
  |   | JavaVendor 		java.lang.String 	R 	Sun Microsystems Inc. 	MBean Attribute.
  |   | JavaVMName 		java.lang.String 	R 	Java HotSpot(TM) Server VM 	MBean Attribute.
  |   | FreeMemory 		java.lang.Long 		R 	87777680 	MBean Attribute.
  |   | ActiveThreadGroupCount 	java.lang.Integer 	R 	10 		MBean Attribute.
  |   | TotalMemory 		java.lang.Long 		R 	166629376 	MBean Attribute.
  |   | JavaVMVersion 		java.lang.String 	R 	1.5.0_07-b03 	MBean Attribute.
  |   | ActiveThreadCount 	java.lang.Integer 	R 	548 		MBean Attribute.
  |   | JavaVMVendor 		java.lang.String 	R 	Sun Microsystems Inc. 	MBean Attribute.
  |   | OSName 			java.lang.String 	R 	Linux 		MBean Attribute.
  |   | JavaVersion 		java.lang.String 	R 	1.5.0_07 	MBean Attribute.
  |   | MaxMemory 		java.lang.Long 		R 	265486336 	MBean Attribute.
  |   | 
  | 
  | Our JGroups protocol stack is the following (for both JBoss cluster, and JBoss Messaging cluster Post-office data and control channel):
  | 
  | 
  |   | STATE_TRANSFER
  |   | use_flush=false
  |   | up_thread=false
  |   | down_thread=false
  |   | 
  |   | GMS
  |   | shun=true
  |   | print_local_addr=true
  |   | up_thread=false
  |   | view_bundling=true
  |   | join_timeout=3000
  |   | join_retry_timeout=2000
  |   | down_thread=false
  |   | 
  |   | STABLE
  |   | max_bytes=400000
  |   | up_thread=false
  |   | stability_delay=1000
  |   | desired_avg_gossip=50000
  |   | down_thread=false
  |   | 
  |   | NAKACK
  |   | max_xmit_size=60000
  |   | up_thread=false
  |   | retransmit_timeout=300,600,1200,2400,4800
  |   | use_mcast_xmit=false
  |   | discard_delivered_msgs=true
  |   | down_thread=false
  |   | gc_lag=0
  |   | 
  |   | VERIFY_SUSPECT
  |   | up_thread=false
  |   | timeout=1500
  |   | down_thread=false
  |   | 
  |   | FD
  |   | max_tries=5
  |   | shun=true
  |   | up_thread=false
  |   | timeout=10000
  |   | down_thread=false
  |   | 
  |   | FD_SOCK
  |   | up_thread=false
  |   | down_thread=false
  |   | 
  |   | MERGE2
  |   | max_interval=10000
  |   | up_thread=false
  |   | down_thread=false
  |   | min_interval=2000
  |   | 
  |   | TCPPING
  |   | port_range=3
  |   | num_initial_members=13
  |   | up_thread=false
  |   | initial_hosts=j2msgtest1[8800],j2msgtest2[8800]
  |   | timeout=3000
  |   | down_thread=false
  |   | 
  |   | TCP
  |   | discard_incompatible_packets=true
  |   | sock_conn_timeout=300
  |   | enable_bundling=false
  |   | bind_addr=192.168.86.201
  |   | max_bundle_size=64000
  |   | use_outgoing_packet_handler=false
  |   | use_send_queues=false
  |   | down_thread=false
  |   | start_port=8800
  |   | recv_buf_size=20000000
  |   | skip_suspected_members=true
  |   | send_buf_size=640000
  |   | use_incoming_packet_handler=true
  |   | loopback=true
  |   | up_thread=false
  |   | tcp_nodelay=true
  |   | max_bundle_timeout=30
  |   | 
  | 
  | Thank you in advance!
  | 
  | With regards,
  | 
  | Thijs Reus
  | Click&Buy Services AG

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138904#4138904

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4138904