[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1456) Messages stuck in being-delivered state in cluster

Howard Gao (JIRA) jira-events at lists.jboss.org
Tue May 26 05:38:01 EDT 2009


    [ https://jira.jboss.org/jira/browse/JBMESSAGING-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12469107#action_12469107 ] 

Howard Gao commented on JBMESSAGING-1456:
-----------------------------------------

As this long issue draws to an end. Let me give a brief summary in simple QA form:

Why messages can got stuck?

It is because the inconsistency of connection failure detection between server and client. 

Both client and server rely on Remoting's ConnectionListener mechanism to detect a connection failure. Remoting uses client ping to find if a connection is broken while it uses lease ping to detect a connection loss at server side. Client ping and Lease ping are working separately. When a client ping fails, Remoting notifies the client side listeners. When a lease ping fails, Remoting notifies the server side listeners. Before the fix, in Remoting the server side and client side by design are not necessarily 'bound together'. That is, a client ping failure is not always result in a server lease ping failure.

In JBM, the server delivers messages to a client over the connection established by the client. If a client connection failure happens when some messages still in delivering mode, the client may stop working and it's not possible for client to issue the close() on the connection. Then the server side maintains all the state associated with the connection, including the sessions and the messages being delivered but not acked. If the server is not notified of the connection failure, the server won't clean up the connection and the associated states. That means the messages will be kept in delivering mode forever -- until the server node is shutdown and restarted.

By 'binding' the client and server connection failure detection together, Remoting guarantees that whenever a client connection failure has been detected, there will always a corresponding notification on the server side. Therefore JBoss Messaging can correctly clean up the resources at the server side, including cancelling the messages in delivery state back to queue for redelivery.






> Messages stuck in being-delivered state in cluster
> --------------------------------------------------
>
>                 Key: JBMESSAGING-1456
>                 URL: https://jira.jboss.org/jira/browse/JBMESSAGING-1456
>             Project: JBoss Messaging
>          Issue Type: Bug
>    Affects Versions: 1.4.0.SP3_CP03, 1.4.0.SP3.CP07
>            Reporter: Justin Bertram
>            Assignee: Howard Gao
>            Priority: Blocker
>             Fix For: 1.4.0.SP3.CP08, 1.4.4.GA
>
>         Attachments: DeliveringCount.png, kill3_thread_dump.txt, logs-and-config.zip, MessageStucked.png, RemoveAllMessagesException.png, test-1456-jars.zip, thread_dump.txt
>
>
> Messages become "stuck" in being-delivered state when clients use a clustered XA connection factory in a cluster of at least 2 nodes.
> JBoss setup:
>   -2 nodes of JBoss EAP 4.3 CP02
>   -commented out "ClusterPullConnectionFactory" in messaging-service.xml to prevent message redistribution and eliminate the "message suckers" as the potential culprit
>   -MySQL backend using the default mysql-persistence-service.xml (from <JBOSS_HOME>/docs/examples/jms)
> Client setup:
>   -both nodes have a client which is a separate process (i.e. not inside JBoss)
>   -clients are Spring based
>   -one client produces and consumes, the other client just consumes
>   -both clients use the ClusteredXAConnectionFactory from the default connection-factories-service.xml
>   -both clients publish to and consume from "queue/testDistributedQueue"
>   -clients are configured to send persistent messages, use AUTO_ACKNOWLEDGE, and transacted sessions
> Symptoms of the issue:
>   -when running the clients I watch the JMX-Console for the "queue/testDistributedQueue"
>   -as the consumers pull messages off the queue I can see the MessageCount and DeliveringCount go to 0 every so often
>   -after a period of time (usually a few hours) the MessageCount and DeliveringCount never go back to 0
>   -I "kill" the clients and wait for the DeliveringCount to go to 0, but it never does
>   -after the clients are killed the ConsumerCount for the queue will drop, but never to 0 when messages are "stuck"
>   -a thread dump reveals at least one JBM server session that is apparently stuck (it never goes away) - ostensibly this is the consumer that is showing in the JMX-Console for "queue/testDistributedQueue"
>   -a "killall -3 java" doesn't produce anything from the clients so I know their dead
>   -nothing is in any DLQ or expiry queue
>   -the database contains as many rows in the JBM_MSG and JBM_MSG_REF tables as the DeliveringCount in the JMX-Console
>   -rebooting the node with the stuck messages frees the messages to be consumed (i.e. un-sticks them)
> Other notes:
>   -nothing else is happening on either node but running the client and running JBoss
>   -this only appears to happen when a clustered connection factory is used.  I tested using a normal connection factory and after 24 hours couldn't reproduce a stuck message.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        



More information about the jboss-jira mailing list