[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1456) Messages stuck in being-delivered state in cluster

Fri Nov 28 04:55:36 EST 2008

    [ https://jira.jboss.org/jira/browse/JBMESSAGING-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12440296#action_12440296 ] 

Howard Gao commented on JBMESSAGING-1456:
-----------------------------------------

Hi,

Just want to give my current status and thought.

I made my first several runnings of tests very short ones. I sent out up to 1000 messages and let the producer stop and let the consumers continue to run. I didn't see any stuck messages from the console.
Eventually the message count becomes zero. When I increased the producer number to 2, I got out of memory error. Now I am trying to move the test client to a separate machine. The consumer count will not be zero unless you finally shutdown the receiving processes. This is because Spring will hold a pool of consumers for efficiency. When all messages are received, the number of consumers will drop down to 5 according to my observation.

I think once the test is started, we should waited for a period until we have found something is wrong with the delivery count and consumers. Actually I think this is hard to tell, because it's quite normal that the delivery count and consumer count on the clustered queue are changing all the time, partly because the JBoss Messaging Server is working continuously (making the delivery count dynamic), and partly because the Spring will dynamically creates consumers during the message receiving (see its DefaultMessageListenerContainer doc).

The mean time to failure for a stuck message may vary depending on the different test client configures and on different hardwares (machines). That may cause misunderstandings between us because your observation may be very different from mine. 

When you feel something wrong already happens, it doesn't matter how you shutdown your producer thread because the message stuck is unlikely caused by producers. However, you cannot shut down the consumer process by kill and then observe -- messages perhaps are being received and not acknowledged. The consumer processes (or threads) must be alive all the time when observing.

So next I suggest

1) Using a single test package for the result discussion among us. I'll soon send out mine test case as the sample, if you guys agree, we use it as the standard test.

2) Using a unified procedure also, this means we using same JBoss Messaging cluster configs and the steps to run the test, later today or tomorrow I'll send out a detailed configure and steps for it for your reference.

I'll make my report/result based totally on the above two things. Hopefully we can make a clear conclusion ASAP.

> Messages stuck in being-delivered state in cluster
> --------------------------------------------------
>
>                 Key: JBMESSAGING-1456
>                 URL: https://jira.jboss.org/jira/browse/JBMESSAGING-1456
>             Project: JBoss Messaging
>          Issue Type: Bug
>    Affects Versions: 1.4.0.SP3_CP03
>            Reporter: Justin Bertram
>            Assignee: Howard Gao
>            Priority: Critical
>         Attachments: kill3_thread_dump.txt, thread_dump.txt
>
>
> Messages become "stuck" in being-delivered state when clients use a clustered XA connection factory in a cluster of at least 2 nodes.
> JBoss setup:
>   -2 nodes of JBoss EAP 4.3 CP02
>   -commented out "ClusterPullConnectionFactory" in messaging-service.xml to prevent message redistribution and eliminate the "message suckers" as the potential culprit
>   -MySQL backend using the default mysql-persistence-service.xml (from <JBOSS_HOME>/docs/examples/jms)
> Client setup:
>   -both nodes have a client which is a separate process (i.e. not inside JBoss)
>   -clients are Spring based
>   -one client produces and consumes, the other client just consumes
>   -both clients use the ClusteredXAConnectionFactory from the default connection-factories-service.xml
>   -both clients publish to and consume from "queue/testDistributedQueue"
>   -clients are configured to send persistent messages, use AUTO_ACKNOWLEDGE, and transacted sessions
> Symptoms of the issue:
>   -when running the clients I watch the JMX-Console for the "queue/testDistributedQueue"
>   -as the consumers pull messages off the queue I can see the MessageCount and DeliveringCount go to 0 every so often
>   -after a period of time (usually a few hours) the MessageCount and DeliveringCount never go back to 0
>   -I "kill" the clients and wait for the DeliveringCount to go to 0, but it never does
>   -after the clients are killed the ConsumerCount for the queue will drop, but never to 0 when messages are "stuck"
>   -a thread dump reveals at least one JBM server session that is apparently stuck (it never goes away) - ostensibly this is the consumer that is showing in the JMX-Console for "queue/testDistributedQueue"
>   -a "killall -3 java" doesn't produce anything from the clients so I know their dead
>   -nothing is in any DLQ or expiry queue
>   -the database contains as many rows in the JBM_MSG and JBM_MSG_REF tables as the DeliveringCount in the JMX-Console
>   -rebooting the node with the stuck messages frees the messages to be consumed (i.e. un-sticks them)
> Other notes:
>   -nothing else is happening on either node but running the client and running JBoss
>   -this only appears to happen when a clustered connection factory is used.  I tested using a normal connection factory and after 24 hours couldn't reproduce a stuck message.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira