[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1456) Messages stuck in being-delivered state in cluster

Friday, 28 November 2008

    [
https://jira.jboss.org/jira/browse/JBMESSAGING-1456?page=com.atlassian.ji...
] 

Zach Kurey commented on JBMESSAGING-1456:
-----------------------------------------

Howard,

Sorry for the slow response, glad you got the client up and running.  I just wanted to
respond to some of your assumptions/how the test is being run.  

"This is because Spring will hold a pool of consumers for efficiency. When all
messages are received, the number of consumers will drop down to 5 according to my
observation. "

The 'cacheLevel' property for the listener container in this test client is set to
0(CACHE_NONE).  There should be no pooling of connections/sessions/nor consumers.  This
leads Spring to continually open a connection, create a session, create a consumer, then
close the consumer/session/connection, with ever poll checking for new messages.  This bug
would probably not be observed so long as the underlying connection factory was a proper
connection pool, but in the default case there is no pool.  This leads to the undesirable
case of opening and closing connections quite rapidly.

"The mean time to failure for a stuck message may vary depending on the different
test client configures and on different hardwares(machines). That may cause
misunderstandings between us because your observation may be very different from
mine."

The way I've been running the test is to produce messages in bursts of 350 messages at
a time.  Then we sleep for 30-40 seconds between bursts letting the queue completely drain
before sending more messages.  You can control this with the message.delay property in the
client.  Also I've been running with two nodes so each node only gets messages
published to it every 60 seconds as the producer is round robining on the two.  This makes
it much easier to witness consumer counts getting reduced on a particular instance below
5.  5 is just another client setting: consumer.min.count.  Once messages are published in
a burst to a particular node the clients threads should all be busy working on that one
node; though they should still be creating and closing connections on the non-busy node
quite quickly.  The connections on the non-busy node may be being opened/closed quickly
enough to not be visible in the console leading to an appearance of a consumer count of 0,
at times.  Regardless eventually a node, usually the one that is also running the MySQL
instance and consumer locally on the same server, eventually will have its consumer count
reach a non-zero floor.  More importantly the delivering count will reach a non-zero floor
which should never happen in this test scenario as each node should get drained entirely
with each burst.  

Also I'm not sure what prefetch size is set on the connection factory your using, but
leaving it at the default, 150, leads to the issue being reproduced much more quickly. 
Reducing the pre-fetch size to 1 the issue can still be reproduced but much less
frequently.

"However, you cannot shut down the consumer process by kill and then observe --
messages perhaps are being received and not acknowledged. The consumer processes (or
threads) must be alive all the time when observing."

The issue is always initially observed with the client consumers still active.  Once it is
clear that the message delivering count is never going to reach 0 again(wait any amount of
time it won't go back to 0), then we've killed the clients and observed the
delivering count is still non-zero.  As far as whether or not killing the clients is a
valid test, I disagree.  If I kill the clients I would always expect the state of the
broker to eventually determine the clients no longer exist and that message delivery was
not successful returning the delivering and client counts both to 0, but maybe I'm
misinterpreting your comment.

...
 Messages stuck in being-delivered state in cluster
 --------------------------------------------------

                 Key: JBMESSAGING-1456
                 URL: https://jira.jboss.org/jira/browse/JBMESSAGING-1456
             Project: JBoss Messaging
          Issue Type: Bug
    Affects Versions: 1.4.0.SP3_CP03
            Reporter: Justin Bertram
            Assignee: Howard Gao
            Priority: Critical
         Attachments: kill3_thread_dump.txt, thread_dump.txt

 Messages become "stuck" in being-delivered state when clients use a clustered
XA connection factory in a cluster of at least 2 nodes.
 JBoss setup:
   -2 nodes of JBoss EAP 4.3 CP02
   -commented out "ClusterPullConnectionFactory" in messaging-service.xml to
prevent message redistribution and eliminate the "message suckers" as the
potential culprit
   -MySQL backend using the default mysql-persistence-service.xml (from
<JBOSS_HOME>/docs/examples/jms)
 Client setup:
   -both nodes have a client which is a separate process (i.e. not inside JBoss)
   -clients are Spring based
   -one client produces and consumes, the other client just consumes
   -both clients use the ClusteredXAConnectionFactory from the default
connection-factories-service.xml
   -both clients publish to and consume from "queue/testDistributedQueue"
   -clients are configured to send persistent messages, use AUTO_ACKNOWLEDGE, and
transacted sessions
 Symptoms of the issue:
   -when running the clients I watch the JMX-Console for the
"queue/testDistributedQueue"
   -as the consumers pull messages off the queue I can see the MessageCount and
DeliveringCount go to 0 every so often
   -after a period of time (usually a few hours) the MessageCount and DeliveringCount
never go back to 0
   -I "kill" the clients and wait for the DeliveringCount to go to 0, but it
never does
   -after the clients are killed the ConsumerCount for the queue will drop, but never to 0
when messages are "stuck"
   -a thread dump reveals at least one JBM server session that is apparently stuck (it
never goes away) - ostensibly this is the consumer that is showing in the JMX-Console for
"queue/testDistributedQueue"
   -a "killall -3 java" doesn't produce anything from the clients so I know
their dead
   -nothing is in any DLQ or expiry queue
   -the database contains as many rows in the JBM_MSG and JBM_MSG_REF tables as the
DeliveringCount in the JMX-Console
   -rebooting the node with the stuck messages frees the messages to be consumed (i.e.
un-sticks them)
 Other notes:
   -nothing else is happening on either node but running the client and running JBoss
   -this only appears to happen when a clustered connection factory is used.  I tested
using a normal connection factory and after 24 hours couldn't reproduce a stuck
message. 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1456) Messages stuck in being-delivered state in cluster