[jboss-user] [JBoss Messaging] - Failover on clustered queues causes thread leak
lerdek
do-not-reply at jboss.com
Wed Mar 26 05:46:33 EDT 2008
Hi,
We have a setup of 2 clustered JBoss servers with clustered messaging with ~150 clustered queues (including EQs and DLQs). If we 'kill -9' one of the nodes, the failover mechanism kicks in, but it creates ~200 new threads which are all in a WAITING state. These threads keep waiting forever, and on every failover this happens again (thus, every failover increases the active thread count of the surviving node by ~200).
Since we still run on a Linux 2.4 kernel, with a practical upper thread count limit of ~600, this is a big problem because if a single node fails for 1 or 2 times, the 'surviving' node dies as a consequence, basically eliminating the concept of a failover mechanism.
We would be very thankful for suggestions on how we can fix this, what appears to be a thread leak.
We have observed already the following things:
| Active thread count before failover: ~150
| Active thread count after 1 failover: ~350
| Active thread count after 2 failovers: ~550
| The active thread count does not decrease over time (it seems the threads are blocked forever)
| Failover scenario without clustered JBoss Messaging is without any apparent thread leak problem
| Failover scenario with only 3 clustered queues is with the thread leak problem, but at a much lower rate (~5 threads per failover) so it appears that the thread leak is linear with the number of clustered queues.
|
|
| At the moment it is not an option to upgrade our kernel to 2.6, which would in fact only reduce the frequency of the symptoms, but not fix the problem. Also, we cannot use a multicast protocol stack because it is not supported on our production environment.
|
| Here's our configuration for both clsutered messaging nodes, which are dedicated messaging servers (thus, no other custom applications have been deployed):
|
|
| | JBoss AS 4.2.2.GA
| | JBoss Messaging 1.4.0sp3
| | JGroups 2.4.1 SP-4
| |
|
| After a failover, a lot of increasingly numbered threads are lying around that look like this (only the thread-name-number differs, e.g. after 1 failover it goes to Thread-274, after 2 failovers to Thread-473):
|
|
| | Thread: Thread-50 : priority:5, demon:true, threadId:268, threadState:WAITING, lockName:java.lang.Object at 62685f
| |
| | java.lang.Object.wait(Native Method)
| | java.lang.Object.wait(Object.java:474)
| | EDU.oswego.cs.dl.util.concurrent.LinkedQueue.take(LinkedQueue.java:122)
| | EDU.oswego.cs.dl.util.concurrent.QueuedExecutor$RunLoop.run(QueuedExecutor.java:83)
| | java.lang.Thread.run(Thread.java:595)
| |
| |
|
| Here's the output of the JMX bean jboss.system/ServerInfo of the surviving node after 2 consequetive failovers:
|
|
| | HostAddress java.lang.String R 192.168.86.9 MBean Attribute.
| | AvailableProcessors java.lang.Integer R 1 MBean Attribute.
| | OSArch java.lang.String R i386 MBean Attribute.
| | OSVersion java.lang.String R 2.4.27-vmware-k7-nosmp MBean Attribute.
| | HostName java.lang.String R xxx MBean Attribute.
| | JavaVendor java.lang.String R Sun Microsystems Inc. MBean Attribute.
| | JavaVMName java.lang.String R Java HotSpot(TM) Server VM MBean Attribute.
| | FreeMemory java.lang.Long R 87777680 MBean Attribute.
| | ActiveThreadGroupCount java.lang.Integer R 10 MBean Attribute.
| | TotalMemory java.lang.Long R 166629376 MBean Attribute.
| | JavaVMVersion java.lang.String R 1.5.0_07-b03 MBean Attribute.
| | ActiveThreadCount java.lang.Integer R 548 MBean Attribute.
| | JavaVMVendor java.lang.String R Sun Microsystems Inc. MBean Attribute.
| | OSName java.lang.String R Linux MBean Attribute.
| | JavaVersion java.lang.String R 1.5.0_07 MBean Attribute.
| | MaxMemory java.lang.Long R 265486336 MBean Attribute.
| |
|
| Our JGroups protocol stack is the following (for both JBoss cluster, and JBoss Messaging cluster Post-office data and control channel):
|
|
| | STATE_TRANSFER
| | use_flush=false
| | up_thread=false
| | down_thread=false
| |
| | GMS
| | shun=true
| | print_local_addr=true
| | up_thread=false
| | view_bundling=true
| | join_timeout=3000
| | join_retry_timeout=2000
| | down_thread=false
| |
| | STABLE
| | max_bytes=400000
| | up_thread=false
| | stability_delay=1000
| | desired_avg_gossip=50000
| | down_thread=false
| |
| | NAKACK
| | max_xmit_size=60000
| | up_thread=false
| | retransmit_timeout=300,600,1200,2400,4800
| | use_mcast_xmit=false
| | discard_delivered_msgs=true
| | down_thread=false
| | gc_lag=0
| |
| | VERIFY_SUSPECT
| | up_thread=false
| | timeout=1500
| | down_thread=false
| |
| | FD
| | max_tries=5
| | shun=true
| | up_thread=false
| | timeout=10000
| | down_thread=false
| |
| | FD_SOCK
| | up_thread=false
| | down_thread=false
| |
| | MERGE2
| | max_interval=10000
| | up_thread=false
| | down_thread=false
| | min_interval=2000
| |
| | TCPPING
| | port_range=3
| | num_initial_members=13
| | up_thread=false
| | initial_hosts=j2msgtest1[8800],j2msgtest2[8800]
| | timeout=3000
| | down_thread=false
| |
| | TCP
| | discard_incompatible_packets=true
| | sock_conn_timeout=300
| | enable_bundling=false
| | bind_addr=192.168.86.201
| | max_bundle_size=64000
| | use_outgoing_packet_handler=false
| | use_send_queues=false
| | down_thread=false
| | start_port=8800
| | recv_buf_size=20000000
| | skip_suspected_members=true
| | send_buf_size=640000
| | use_incoming_packet_handler=true
| | loopback=true
| | up_thread=false
| | tcp_nodelay=true
| | max_bundle_timeout=30
| |
|
| Thank you in advance!
|
| With regards,
|
| Thijs Reus
| Click&Buy Services AG
View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138904#4138904
Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4138904
More information about the jboss-user
mailing list