[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1822) MessageSucker failures cause the delivery of the failed message to stall

Fri Oct 29 04:48:54 EDT 2010

    [ https://jira.jboss.org/browse/JBMESSAGING-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560210#action_12560210 ] 

Yong Hao Gao commented on JBMESSAGING-1822:
-------------------------------------------

I have finished the implementation. During the coding process, i've made various changes to my previous proposal. I'll go on with some tests to verify it. Below is what I actually did:

Changes to the code:

1. Introduce a new state in JBM_MSG_REF table's STATE column. The new state 'S' marks that the message is in a special "To be sucked" state.
2. Change the sucking process as the following:

a) When a Message M is ready to be sucked, it is put to the remote consumer (ServerConsumerEndpoint) for delivery. This remote consumer on accepting M, update the M's state to 'S'. Then it goes on to actually deliver M to the Message Sucker on target node.

b) When the message sucker receives M, now we acknowledge it and then sends it to the local queue, M's state changed to 'C' (normal).

c) When the acknowledgement arrives at the source node, the session simply forget the message (without any DB operations).

d) Failover -- when mergeIn the failed channel, it will try to claim 'S' messages.

Failure handling

The sucking process will have to handle the following failures. Suppose we have a message M that is just in the process of being sucked. 'Source node' refers to the node where the source channel (from which M is sucked) resides. 'Target node' refers to the node where the target channel (to which M is sucked) resides.

I. Source node crash. Different ways of handling are given based on where M is at the moment of crash.

I.a. M is being processed at the source node. 

Normal processing steps: M's state is changed from 'C' to 'S' and then sent for delivery to target node (sucker).

Case-I.a.1  M's state is 'C'. That means M is still a normal message. The sucking process has no effect on it.

Case-I.a.2  M's state has been changed to 'S'. M will be merged to backup node on server failover. It then will be loaded as normal message for delivery.

I.b. M is being processed at the target node

Normal processing steps: M is first acked and then sent to local queue (M updated to the target channel and its state changed from 'S' to 'C').

I.b.1  M's state is 'S' and hasn't been acked. So the ack will fail. M remains in 'S' with the source channel. Target node will not get M sucked eventually. M will be merged to other channel for re-delivery.

I.b.2  M's state is 'S' and has been acked successfully. M will go on to be sent to local queue. However there is a contention between the sending and failover. The failover will try to merge any M that is in 'S' state of the source channel, while the sending will try to update M to the target channel. We guarantee only one of the two actions will succeed, i.e. M will either be merged to another channel or be sucked to the target channel.

II. Target node crash. Still different ways of handling are given based on where M is at the moment of crash.

II-a. M is being processed at the source node.

M will be processed as normal, only the delivery will fail as the target node is down. However connection failure will come in and close the sessions, in turn sessions will cancel all existing deliveries, including M. When canceling M, its state will also be changed back to 'C' if it is already in 'S' state.

II.b. M is being processed at the target node.

II.b.1  M hasn't been acked yet. It will be eventually canceled at the source node for re-delivery.

II.b.2  M has been acked successfully but haven't been sent to local queue. When source node eventually get the connection failure notification, it will close the related session where M will be reclaimed from the DB and put to the channel for redelivery.

III. Cases that involve crashes of both source and target nodes

Depending on the order and timing of the crashes of both nodes, there are cases that may be same as those in the above one-crash cases. We only consider the case where M is in a specific state bit "C" or "S" but none of the above handlings are successful.

If M is in "C" state after both has crashed, that means M either is with source channel or is with the target channel. In either case M is a normal message. We don't need to worry about it.

If M is in "S" state after both has crashed, that means M is in sucking process and we cannot know how far the process has gone before crashing. It is the responsibility of source node to reclaim it to its source channel when startup, and then redeliver it as a normal message. If the crashed source server is failed over, it is the failover server that reclaims M.

IV. Cluster Connection failures

Sometimes both servers are alive but the network that links them becomes problematic. When ever that happened, either node behaves as if the other has failed. We guarantee that only one node got M and processed it successfully. For example, when source node session is closing and canceling M's delivery, target node may send in the ack of M. We synchronized this two actions so that M either be canceled back to source channel (resulting in ack failure) or be acked before canceling.

Conclusion

So, for issues specific to this JIRA report, when send is broken, the message will be reclaimed by the source node (case II.b.2) for redelivery.

> MessageSucker failures cause the delivery of the failed message to stall
> ------------------------------------------------------------------------
>
>                 Key: JBMESSAGING-1822
>                 URL: https://jira.jboss.org/browse/JBMESSAGING-1822
>             Project: JBoss Messaging
>          Issue Type: Bug
>          Components: Messaging Core
>    Affects Versions: 1.4.6.GA
>            Reporter: david.boeren
>            Assignee: Yong Hao Gao
>             Fix For: Unscheduled
>
>         Attachments: helloworld.zip
>
>
> The MessageSucker is responsible for migrating messages between different members of a cluster, it is a consumer to the remote queue from which it receives messages destined for the queue on the local cluster member. 
> The onMessage routine, at its most basic, does the following 
> - bookkeeping for the incoming message, including expiry 
> - acknowledge the incoming message 
> - attempt to deliver to the local queue 
> When the delivery fails, the result is the *appearance* of lost messages. Those messages which are processed during the failure are not redelivered, but they still exist in the database. 
> The only way I have found to trigger the redelivery of those messages is to redeploy the queue containing the messages and/or restart that app server. Obviously neither approach is acceptable. 
> In order to trigger the error I created a SOA cluster which *only* shared the JMS database, and no other. I modified the helloworld quickstart to display a counter of messages consumed, clustered the *esb* queue, and then used byteman to trigger the faults. 
> The byteman rule is as follows, the quickstart will be attached. 
> RULE throw every fifth send 
> INTERFACE ProducerDelegate 
> METHOD send 
> AT ENTRY 
> IF callerEquals("MessageSucker.onMessage", true) && (incrementCounter("throwException") % 5 == 0) 
> DO THROW new IllegalStateException("Deliberate exception") 
> ENDRULE 
> This results in an exception being thrown for every fifth message. Once the delivery has quiesced, examine the JBM_MSG and JBM_MSG_REF tables to see the messages which have not been delivered. 
> The clusters are ports-default and ports-01, the client seeds the gateway by sending 300 messages to the default. 
> Adding up the counter from each server *plus* the message count from JBM_MSG results in 300 (or multiples thereof for more executions).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira