[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1822) MessageSucker failures cause the delivery of the failed message to stall

Friday, 29 October 2010

    [
https://jira.jboss.org/browse/JBMESSAGING-1822?page=com.atlassian.jira.pl...
] 

Yong Hao Gao commented on JBMESSAGING-1822:
-------------------------------------------

I have finished the implementation. During the coding process, i've made various
changes to my previous proposal. I'll go on with some tests to verify it. Below is
what I actually did:

Changes to the code:

1. Introduce a new state in JBM_MSG_REF table's STATE column. The new state
'S' marks that the message is in a special "To be sucked" state.
2. Change the sucking process as the following:

a) When a Message M is ready to be sucked, it is put to the remote consumer
(ServerConsumerEndpoint) for delivery. This remote consumer on accepting M, update the
M's state to 'S'. Then it goes on to actually deliver M to the Message Sucker
on target node.

b) When the message sucker receives M, now we acknowledge it and then sends it to the
local queue, M's state changed to 'C' (normal).

c) When the acknowledgement arrives at the source node, the session simply forget the
message (without any DB operations).

d) Failover -- when mergeIn the failed channel, it will try to claim 'S'
messages.

Failure handling

The sucking process will have to handle the following failures. Suppose we have a message
M that is just in the process of being sucked. 'Source node' refers to the node
where the source channel (from which M is sucked) resides. 'Target node' refers to
the node where the target channel (to which M is sucked) resides.

I. Source node crash. Different ways of handling are given based on where M is at the
moment of crash.

I.a. M is being processed at the source node. 

Normal processing steps: M's state is changed from 'C' to 'S' and then
sent for delivery to target node (sucker).

Case-I.a.1  M's state is 'C'. That means M is still a normal message. The
sucking process has no effect on it.

Case-I.a.2  M's state has been changed to 'S'. M will be merged to backup node
on server failover. It then will be loaded as normal message for delivery.

I.b. M is being processed at the target node

Normal processing steps: M is first acked and then sent to local queue (M updated to the
target channel and its state changed from 'S' to 'C').

I.b.1  M's state is 'S' and hasn't been acked. So the ack will fail. M
remains in 'S' with the source channel. Target node will not get M sucked
eventually. M will be merged to other channel for re-delivery.

I.b.2  M's state is 'S' and has been acked successfully. M will go on to be
sent to local queue. However there is a contention between the sending and failover. The
failover will try to merge any M that is in 'S' state of the source channel, while
the sending will try to update M to the target channel. We guarantee only one of the two
actions will succeed, i.e. M will either be merged to another channel or be sucked to the
target channel.

II. Target node crash. Still different ways of handling are given based on where M is at
the moment of crash.

II-a. M is being processed at the source node.

M will be processed as normal, only the delivery will fail as the target node is down.
However connection failure will come in and close the sessions, in turn sessions will
cancel all existing deliveries, including M. When canceling M, its state will also be
changed back to 'C' if it is already in 'S' state.

II.b. M is being processed at the target node.

II.b.1  M hasn't been acked yet. It will be eventually canceled at the source node for
re-delivery.

II.b.2  M has been acked successfully but haven't been sent to local queue. When
source node eventually get the connection failure notification, it will close the related
session where M will be reclaimed from the DB and put to the channel for redelivery.

III. Cases that involve crashes of both source and target nodes

Depending on the order and timing of the crashes of both nodes, there are cases that may
be same as those in the above one-crash cases. We only consider the case where M is in a
specific state bit "C" or "S" but none of the above handlings are
successful.

If M is in "C" state after both has crashed, that means M either is with source
channel or is with the target channel. In either case M is a normal message. We don't
need to worry about it.

If M is in "S" state after both has crashed, that means M is in sucking process
and we cannot know how far the process has gone before crashing. It is the responsibility
of source node to reclaim it to its source channel when startup, and then redeliver it as
a normal message. If the crashed source server is failed over, it is the failover server
that reclaims M.

IV. Cluster Connection failures

Sometimes both servers are alive but the network that links them becomes problematic. When
ever that happened, either node behaves as if the other has failed. We guarantee that only
one node got M and processed it successfully. For example, when source node session is
closing and canceling M's delivery, target node may send in the ack of M. We
synchronized this two actions so that M either be canceled back to source channel
(resulting in ack failure) or be acked before canceling.

Conclusion

So, for issues specific to this JIRA report, when send is broken, the message will be
reclaimed by the source node (case II.b.2) for redelivery.

...
 MessageSucker failures cause the delivery of the failed message to
stall
 ------------------------------------------------------------------------

                 Key: JBMESSAGING-1822
                 URL: https://jira.jboss.org/browse/JBMESSAGING-1822
             Project: JBoss Messaging
          Issue Type: Bug
          Components: Messaging Core
    Affects Versions: 1.4.6.GA
            Reporter: david.boeren
            Assignee: Yong Hao Gao
             Fix For: Unscheduled

         Attachments: helloworld.zip

 The MessageSucker is responsible for migrating messages between different members of a
cluster, it is a consumer to the remote queue from which it receives messages destined for
the queue on the local cluster member. 
 The onMessage routine, at its most basic, does the following 
 - bookkeeping for the incoming message, including expiry 
 - acknowledge the incoming message 
 - attempt to deliver to the local queue 
 When the delivery fails, the result is the *appearance* of lost messages. Those messages
which are processed during the failure are not redelivered, but they still exist in the
database. 
 The only way I have found to trigger the redelivery of those messages is to redeploy the
queue containing the messages and/or restart that app server. Obviously neither approach
is acceptable. 
 In order to trigger the error I created a SOA cluster which *only* shared the JMS
database, and no other. I modified the helloworld quickstart to display a counter of
messages consumed, clustered the *esb* queue, and then used byteman to trigger the faults.

 The byteman rule is as follows, the quickstart will be attached. 
 RULE throw every fifth send 
 INTERFACE ProducerDelegate 
 METHOD send 
 AT ENTRY 
 IF callerEquals("MessageSucker.onMessage", true) &&
(incrementCounter("throwException") % 5 == 0) 
 DO THROW new IllegalStateException("Deliberate exception") 
 ENDRULE 
 This results in an exception being thrown for every fifth message. Once the delivery has
quiesced, examine the JBM_MSG and JBM_MSG_REF tables to see the messages which have not
been delivered. 
 The clusters are ports-default and ports-01, the client seeds the gateway by sending 300
messages to the default. 
 Adding up the counter from each server *plus* the message count from JBM_MSG results in
300 (or multiples thereof for more executions). 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[jboss-jira] [JBoss JIRA] Commented: (JBMESSAGING-1822) MessageSucker failures cause the delivery of the failed message to stall