[
https://jira.jboss.org/jira/browse/JBMESSAGING-1456?page=com.atlassian.ji...
]
Howard Gao commented on JBMESSAGING-1456:
-----------------------------------------
Here is some update.
--- Analysis of Message Stuck ---
1. Message A arrived at Queue on node0.
2. Client creates conn1/sess1/cons1 to receive messages.
3. Message A is being delivered to cons1, but the message delivery failed at remoting
layer due to timeout. So node0 will mark this cons1 dead and deliver messages to other
consumers. However the timeout didn't cause the remoting to think of the connection
being broken (maybe the connection was just occasionally slow), so JBM node0 didn't
get connection failure notification, therefore the conn1/sess1/cons1 at the server node0
were not cleared (that means message A was not put back to queue for re-delivery).
4. before client closes conn1/sess1/cons1, client ping timout happened and it will trigger
client side failover. The result is that internally a new connection to node1
(conn2/sess2/cons2) was created. To client this happened transparently, so it looks like
messages continue coming in to the same consumer, but internally the consumer has
changed.
5. client tried to close conn1/sess1/cons1 (actually it is closing conn2/sess2/cons2 now).
node1 will close conn2/sess2/cons2 accordingly.
6. client created another set of conn/sess/cons and continue to work.
7. at this point, the conn1/sess1/cons1 at the server side keep open, so the message A
will never be put back into queue, so stuck.
8. The message A will get redelivered if a) we shut down the client process and restart
the client; or b) we shutdown the server node0 and restart the node0.
9. During the client failover, the client tries to clean up the underlying remoting
Clients of the current connection in the following order: (before failing over to other
nodes)
Client.setDisconnectTimeout(0)
Client.removeListener();
Client.disconnect();
The disconnect method stops the client ping and terminate the client lease. The client
lease termination will involve a remote invocation. This method is not likely be
successful as the underlying connection is the reason for this failover process.
If the disconnect() call fails, the result is the client side ping will be stopped, but
the server doesn't get notified. So the remoting server will eventually lost the lease
ping and will notify server side connection listener, in which the proper cleanup will be
performed. Message A will be put back to queue for delivery. Everything is fine, no stuck
message, no long living connections/sessions/consumers.
However, in a network that is not so stable (meaning the network latency is not reasonably
consistent, occasionally it is very slow but it rarely dies), the disconnection() may be
totally successful.
That means the remoting server will get the signal and think of it a normal disconnection
from the client. Then no connection listener is called on the server side. The effect is
that the underlying remoting connection is shutdown normally but all the JBM objects (open
connections, sessions and consumers) that are associated with this remoting connection
will stay alive forever unless the server node is shut down. If that happens, Message A
will not get a chance to be put back to queue for redelivery.
--- Proposed Fix ---
Based on the above, I suggest the following fix:
When client failover happens, we only do the client side unilateral cleanup, without
trying to tell the remoting server. This will result in a definite remoting server failure
being detected, regardless of the real network connection condition, therefore the server
side JBM cleanup will be guaranteed. To do this we need the remoting to support operations
similar to the following
Client.disconnectUnilateral()
{
stopClientPing(); //normal cleanup as before
stopLeasepingUnilateral(); //stop lease ping from client side, without contact the
server
....
}
Messages stuck in being-delivered state in cluster
--------------------------------------------------
Key: JBMESSAGING-1456
URL:
https://jira.jboss.org/jira/browse/JBMESSAGING-1456
Project: JBoss Messaging
Issue Type: Bug
Affects Versions: 1.4.0.SP3_CP03
Reporter: Justin Bertram
Assignee: Howard Gao
Priority: Critical
Fix For: 1.4.0.SP3.CP08, 1.4.3.GA
Attachments: kill3_thread_dump.txt, thread_dump.txt
Messages become "stuck" in being-delivered state when clients use a clustered
XA connection factory in a cluster of at least 2 nodes.
JBoss setup:
-2 nodes of JBoss EAP 4.3 CP02
-commented out "ClusterPullConnectionFactory" in messaging-service.xml to
prevent message redistribution and eliminate the "message suckers" as the
potential culprit
-MySQL backend using the default mysql-persistence-service.xml (from
<JBOSS_HOME>/docs/examples/jms)
Client setup:
-both nodes have a client which is a separate process (i.e. not inside JBoss)
-clients are Spring based
-one client produces and consumes, the other client just consumes
-both clients use the ClusteredXAConnectionFactory from the default
connection-factories-service.xml
-both clients publish to and consume from "queue/testDistributedQueue"
-clients are configured to send persistent messages, use AUTO_ACKNOWLEDGE, and
transacted sessions
Symptoms of the issue:
-when running the clients I watch the JMX-Console for the
"queue/testDistributedQueue"
-as the consumers pull messages off the queue I can see the MessageCount and
DeliveringCount go to 0 every so often
-after a period of time (usually a few hours) the MessageCount and DeliveringCount
never go back to 0
-I "kill" the clients and wait for the DeliveringCount to go to 0, but it
never does
-after the clients are killed the ConsumerCount for the queue will drop, but never to 0
when messages are "stuck"
-a thread dump reveals at least one JBM server session that is apparently stuck (it
never goes away) - ostensibly this is the consumer that is showing in the JMX-Console for
"queue/testDistributedQueue"
-a "killall -3 java" doesn't produce anything from the clients so I know
their dead
-nothing is in any DLQ or expiry queue
-the database contains as many rows in the JBM_MSG and JBM_MSG_REF tables as the
DeliveringCount in the JMX-Console
-rebooting the node with the stuck messages frees the messages to be consumed (i.e.
un-sticks them)
Other notes:
-nothing else is happening on either node but running the client and running JBoss
-this only appears to happen when a clustered connection factory is used. I tested
using a normal connection factory and after 24 hours couldn't reproduce a stuck
message.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira