[jboss-jira] [JBoss JIRA] Issue Comment Edited: (JBMESSAGING-1822) MessageSucker failures cause the delivery of the failed message to stall

Mon Oct 25 11:33:54 EDT 2010

    [ https://jira.jboss.org/browse/JBMESSAGING-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558954#action_12558954 ] 

Yong Hao Gao edited comment on JBMESSAGING-1822 at 10/25/10 11:33 AM:
----------------------------------------------------------------------

As XA is not a preferred way to this issue, I come up with an idea that can hopefully solve the issue. I give it here for discussion.

1. Introduce a new state in JBM_MSG_REF table's STATE column. The new state 'S' marks that the message is in a special "To be sucked" state.
2. Change the sucking process as the following:

a) When a Message M is ready to be sucked, it is put to the remote consumer (ServerConsumerEndpoint) for delivery. This remote consumer on accepting M, update the M's state to 'S'. Then it goes on to actually deliver M to the Message Sucker on another node.

b) When the message sucker receives M, it first acknowledges it and then sends it to the local queue, after that the message state changes back to 'C' (normal).

c) When the acknowledgement arrives at the source node, the session simply forget the message (without any DB operations).

Failure handling

The sucking process will have to handle the following failures

1. Source node crash. That will leave M in either 'C' (normal state) or 'S' state. If M is in normal state. When the node is restarted, it will be delivered as normal. If M is already in 'S' state, this message won't be delivered as normal messages when the node is restarted, no matter whether this M has been delivered to the Sucker or not. At the target node, if the sucker has received the message and updated it successfully, M will be in the local queue already. If the sucker has received the message but failed to update M, leaving M in 'S' state. 

When source node comes up again, the sucker will reconnected to the source node and registers a remote consumer. On starting up of the remoting consumer, it will first check in DB is there is a message that has state of 'S', if so it picks up this message and deliver it to the sucker right away.

2. Target node crash. That will leave M in several situations:

1st - M being put to the remote consumer be hasn't been updated to 'S' state. Then M will eventually be cancelled to queue for re-delivery.
2nd - M being put to the remote consumer and updated to 'S' state, but not yet delivered. When remote consumer be closed eventually, it will acknowlege it to the Session to forget it. That prevent the session from redeliver M when it is in 'S' state. When target node comes up and a new remote consumer registered, the message will be picked up and delivered (the first thing such a remote consumer does when started).
3rd - M being put to the remote consumer and updated to 'S' state and delivered to the sucker in target node, but failed to acknowledge it. This is equivalent to 2nd case, i.e. when target node is back the message M will be redelivered.
4th - M being delivered to target node and acked, but failed to update its state (still in 'S'). Same thing as 3rd, i.e. the M will be picked up when targe node is back and a remote consumer is registered.

3. Both nodes crash. We just monitor the state of M. When either node starts up again, it ignores messages marked as 'S' states. Once a sucker is created and registers a remote consumer, it's first task is to look up the message in 'S' state and deliver it if any. 

      was (Author: gaohoward):

As XA is not a preferred way to this issue, I come up with an idea that can hopefully solve the issue. I give it here for discussion.

1. Introduce a new state in JBM_MSG_REF table's STATE column. The new state 'S' marks that the message is in a special "To be sucked" state.
2. Change the sucking process as the following:

a) When a Message M is ready to be sucked, it is put to the remote consumer (ServerConsumerEndpoint) for delivery. This remote consumer on accepting M, update the M's state to 'S'. Then it goes on to actually deliver M to the Message Sucker on another node.

b) When the message sucker receives M, it first acknowledges it and then sends it to the local queue.

c) When the acknowledgement arrives at the source node, the session simply forget the message (without any DB operations).

Failure handling

The sucking process will have to handle the following failures

1. Source node crash. That will leave M in either 'C' (normal state) or 'S' state. If M is in normal state. When the node is restarted, it will be delivered as normal. If M is already in 'S' state, this message won't be delivered as normal messages when the node is restarted, no matter whether this M has been delivered to the Sucker or not. At the target node, if the sucker has received the message and updated it successfully, M will be in the local queue already. If the sucker has received the message but failed to update M, leaving M in 'S' state. 

When source node comes up again, the sucker will reconnected to the source node and registers a remote consumer. On starting up of the remoting consumer, it will first check in DB is there is a message that has state of 'S', if so it picks up this message and deliver it to the sucker right away.

2. Target node crash. That will leave M in several situations:

1st - M being put to the remote consumer be hasn't been updated to 'S' state. Then M will eventually be cancelled to queue for re-delivery.
2nd - M being put to the remote consumer and updated to 'S' state, but not yet delivered. When remote consumer be closed eventually, it will acknowlege it to the Session to forget it. That prevent the session from redeliver M when it is in 'S' state. When target node comes up and a new remote consumer registered, the message will be picked up and delivered (the first thing such a remote consumer does when started).
3rd - M being put to the remote consumer and updated to 'S' state and delivered to the sucker in target node, but failed to acknowledge it. This is equivalent to 2nd case, i.e. when target node is back the message M will be redelivered.
4th - M being delivered to target node and acked, but failed to update its state (still in 'S'). Same thing as 3rd, i.e. the M will be picked up when targe node is back and a remote consumer is registered.

3. Both nodes crash. We just monitor the state of M. When either node starts up again, it ignores messages marked as 'S' states. Once a sucker is created and registers a remote consumer, it's first task is to look up the message in 'S' state and deliver it if any. 

> MessageSucker failures cause the delivery of the failed message to stall
> ------------------------------------------------------------------------
>
>                 Key: JBMESSAGING-1822
>                 URL: https://jira.jboss.org/browse/JBMESSAGING-1822
>             Project: JBoss Messaging
>          Issue Type: Bug
>          Components: Messaging Core
>    Affects Versions: 1.4.6.GA
>            Reporter: david.boeren
>            Assignee: Yong Hao Gao
>             Fix For: Unscheduled
>
>         Attachments: helloworld.zip
>
>
> The MessageSucker is responsible for migrating messages between different members of a cluster, it is a consumer to the remote queue from which it receives messages destined for the queue on the local cluster member. 
> The onMessage routine, at its most basic, does the following 
> - bookkeeping for the incoming message, including expiry 
> - acknowledge the incoming message 
> - attempt to deliver to the local queue 
> When the delivery fails, the result is the *appearance* of lost messages. Those messages which are processed during the failure are not redelivered, but they still exist in the database. 
> The only way I have found to trigger the redelivery of those messages is to redeploy the queue containing the messages and/or restart that app server. Obviously neither approach is acceptable. 
> In order to trigger the error I created a SOA cluster which *only* shared the JMS database, and no other. I modified the helloworld quickstart to display a counter of messages consumed, clustered the *esb* queue, and then used byteman to trigger the faults. 
> The byteman rule is as follows, the quickstart will be attached. 
> RULE throw every fifth send 
> INTERFACE ProducerDelegate 
> METHOD send 
> AT ENTRY 
> IF callerEquals("MessageSucker.onMessage", true) && (incrementCounter("throwException") % 5 == 0) 
> DO THROW new IllegalStateException("Deliberate exception") 
> ENDRULE 
> This results in an exception being thrown for every fifth message. Once the delivery has quiesced, examine the JBM_MSG and JBM_MSG_REF tables to see the messages which have not been delivered. 
> The clusters are ports-default and ports-01, the client seeds the gateway by sending 300 messages to the default. 
> Adding up the counter from each server *plus* the message count from JBM_MSG results in 300 (or multiples thereof for more executions).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira