[jboss-dev-forums] [Design of Messaging on JBoss (Messaging/JBoss)] - Re: Client failover redeliveries discussion

Tue Oct 24 20:28:29 EDT 2006

Clebert wrote : 
  | If the consumer receives a message from CallBack but if it didn't send an ACK yet, after the failover, the server not knowing the message might throw an exception (messageId not known).
  | 
  | There are a couple of use cases we have to consider.
  | - Persistent Messages. (how to treat a redelivery).
  | - Should we send the list of previously ACKs to the server?
  | - Should we ignore ACKs for non existent messages on the server?
  | 

Let's systematize a bit the cases we need to deal with on fail-over. In order to do that, we need to understand what is maintained as transitory state (in flight messages and acknowledgments) by the client side delegate hierarchy.

A client may happen to be sending a message when the failure occurs. If the message is sent individually (not in the context of a transaction), then the on-going synchronous invocation is going to fail, the client code will catch the exception, and most likely will re-try to send the message, hopefully over a correctly failed-over connection. So there is nothing else to do here at this stage. (There may be something, though: for a truly seamless client fail-over, we need to handle this situation and retry sending the message by ourselves without client code actually noticing anything unusual, maybe just a send call that takes longer than usual).

If the client happens to send message in the context of a transaction when the failure occurs, we could either throw an exception, and discard everything, or go for the more elegant solution of transparently copying the transactional state (the corresponding TxState instance) into the new ResourceManager and send the messages over the new connection when transaction commits. We're probably not doing this right now, but this is what should be doing.

It is interesting to consider also what happens when the failure occurs right in the middle of "sentTransaction()" invocation.

These are example of use cases we need to deal with, when sending and failover are concerned. Each such use case must be  captured in its own test case. I will start working towards that.

The series of use cases to be considered continues with situations in which messages being received are affected by failure.

Messages arrive to the MessageCallbackHandler instance, which is an consumer delegate accessory (there is an one-to-one relationship between the MessageCallbackHandler instances and the consumer delegate instances). At this stage, it really doesn't make any difference if they are persistent or non-persistent messages, they are treated all the same.

If the failure occurs while a message is being sent to the MessageCallbackHandler, the in-flight message instance is simply lost (there will be some error log statements on the server-side, if it's only a network failure, or there won't be anything at all, if the server box goes down). For a non persistent message, this is the expected outcome in case of failure, and for a persistent message, it will be eventually recovered from the database, so everything will end up fine.

Nothing to do here from a client-failover perspective.

What is more interesting is what happens with the messages that are already in the MessageCallbackHandler's buffer.

For a seamless fail-over, they will need to be transferred in the new MessageCallbackHandler's buffer. Also important, immediately after the failover condition is detected, any in-progress read should be completed, and no further reads should be accepted until the client-side fail-over is complete ("client side failover lockdown"). The next post-failover read should be done from the new MessagingCallbackHandler's buffer.

Contrary to what has been discussed so far on this thread, I think we can also salvage non-persistent messages, with minimum of effort. I'll address this issue again later. The acknowledgments for these messages (persistent and non-persistent) will be sent by the new Connection Delegate.

We also have the acknowledgments accumulated in a transaction on the client-side. The case should be dealt with similarly with the way we handle transacted messages (copy the TxState instance).

Again, for all these cases we should have tests.

Clebert wrote : 
  | Second point also:
  | What to do when a durable subscriber gets the queue refilled?
  | - The client will probably receive the message again. I would just ignore redeliveries.
  | 

Tim wrote : 
  | Yes - we should send the ids of every persistent message as part of the failover protocol - the server then repopulates the delivery list in the server consumer endpoint
  | 

This will help us avoid situation when a "failed-over" message is consumed on the client-side, the consumer delegate sends the acknowledgment and the new server consumer endpoint doesn't know what the client is talking about, because there's no active delivery for that message. I think we can go a step further and also send the IDs of non-persistent messages that have been "failed-over" on the client side. This way, the client will continue to receive (and successfully acknowledge) non-persistent messages that otherwise would have been lost.

Tim wrote : 
  | Clebert wrote : 
  |   | - Should we ignore ACKs for non existent messages on the server?
  |   | 
  | Non existent messages on the server will be non persistent messages that didn't survive the failover.
  | They should be removed from the client side list on failover so the acks will never get sent.
  | 

Not necessarily. See my above comment. We could also include the ids of non-persistent messages with the list of message ID sent to the server as part of the failover protocol, and thus be able to "salvage" those messages as well. I don't see any problem if we do that. We get better fault tolerance.

Clebert wrote : 
  | failoever receives a new ClientConnectionDelegate as the parameter. The idea is to get a new connection, but keep the actual delegates we are using.
  | 

We need to make sure that the "new" connection is then properly discarded so it will eventually garbage collected. Minor thing, anyway, just an observation.

Clebert wrote : 
  | Creating a new connection on the new server will create a new server Object, consequently a new ServerId.
  | 

You probably want to say that creating a new connection will subsequently create a new server-side connection endpoint instance, and because we're connecting to a different server node, we need to use its serverID instead of the "dead node" serverID, right?

Clebert wrote : 
  | Tim wrote : 
  |   | Clebert wrote : 
  |   |   | Second point also:
  |   |   | What to do when a durable subscriber gets the queue refilled?
  |   |   | - The client will probably receive the message again. I would just ignore redeliveries.
  |   |   | 
  |   | I don't understand the issue here. Can you explain more?
  |   | 
  | 
  | The consumer is going to be recreated the same way the connection on the previous example. (creating a new consumer / replacing the IDs on the old objects).
  | 
  | On that case, the new server will think it's a new client connecting and it will resend non committed messages from a durable subscription. (So I hope).
  | 
  | On that case, I'm considering to ignore message previously sent, and on the list of CurrentTransaction.ACKs(). I just want to know if this is everybody's expected behavior.
  | 

What about the "fail-over protocol"? Your statement above seem to assume that the new server node is called into without any "preparation", as would a completely new client that creates a new connection, session and consumer endpoints. This is not going to work, those server-side objects need to undergo a "post-failover" preparation phase, where deliveries for the client-side failed over messages are created and so forth.

Tim wrote : 
  | So to summarise:
  | 1) Detect failover
  | 2) Find the "correct" failover server. (This may take several hops)
  | 3) Let the server "stall" you until server failover has completed
  | 4) Recreate the conections, sessions, consumers, producers and browsers. (Swapping ids here sounds fine)
  | 5) Delete any non persistent messages from the client list of unacked messages in any sessions in the failed connection.
  | 6) Send a list of the ids of the peristent messages for each consumer that failed to the server. For each list recreate the ServerConsumerEndpoint delivery list by removing the refs from the channel and creating deliveries and putting in the list.
  | 7) The connection is now ready to be used 
  | 

Clebert wrote : 
  | At this point I'm not detecting the failure at remoting level yet.
  | 

At this point, given the way Remoting discussion evolve, we can safely assume we'll use Remoting connection failure facilities to detect our connection failure.

Tim wrote : 
  | So to summarise:
  | ...
  | 3) Let the server "stall" you until server failover has completed
  | ...
  | 

What exactly does this mean?

Why? See my comment above. Why do you think "salvaging" non-persistent messages too isn't going to work?

Tim wrote : 
  | So to summarise:
  | ...
  | 6) Send a list of the ids of the peristent messages for each consumer that failed to the server. For each list recreate the ServerConsumerEndpoint delivery list by removing the refs from the channel and creating deliveries and putting in the list.
  | ...
  | 

Non-persistent message ids too.

We also need to insure that the client code is safely prevented from using the connection(s) in a transitory state during the fail over. This would be the "client-side failover lockdown" ....

Tim wrote : 
  | Note: We must also ensure that no new connections are created on the failover node while old connections are being recreated, otherwise we can have a situation where the new connections grab the messages which have already been delivered to consumers in the failed connecton!
  | 

... and this would be the corresponding "server-side failover lockdown".

Clebert wrote : 
  | Tim wrote : 
  |   | I don't see why anything would be received twice (apart from non persistent
  |   | messages since we lose the acks for them but that's fine)
  |   | 
  | There are some scenarios I'm thinking about, all of them under incomplete transactions.
  | 
  | I will create few testcases and will come up with them to the discussion later.
  | 

I have started a list of scenarios at the beginning of this post. The way should be deal with this situation is to create test cases for each of use cases we are going to support. This way we make sure that if we refactor the code later, it still behaves how we originally wanted it to. No new functionality should ideally be added if there is no test case for it.

Clebert wrote : 
  | Every time a new serverObject is created [...]
  | 

Every time a new server-side endpoint is created ... :)

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3980578#3980578

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3980578