Following our meeting today with Bela, Brian, Clebert, Ovidiu and myself here are my
thoughts as to what we need to do:
We need to main a cluster wide consistent mapping of node -> List
Where FailoverInfo contains the JGroups address of the failover node (actually the async
stack) and the remoting invoker locator.
Whether the actually remoting invoker locator is used depends on whether we use remoting,
but in any case it is some kind of address the client can use to make a connection to a
server.
i.e.
| struct FailoverInfo
| {
| Address address;
| InvokerLocator locator;
| }
|
We need to maintain a list per node, since any particular node can have 1 or more failover
nodes.
We should look at using the Distributed Replicator Manager code from JBoss AS clustering
which Brian has separated into it's own jar for actually maintaining this state across
the cluster.
This has nice functionality such as being able to register listeners to respond to
changes.
If we can't use this then it's not a big deal just to use JGroups state directly
as we already do for managing the binding state.
When a node joins the group we should generate the failover info list for that node before
registering it with the group.
A simple algorithm would be to consider the JGroups view list as a circular buffer and
choose failover nodes as the next items in the view list to the current node. We would
have to be careful to not choose addresses on the current node as failovers. We should
make this pluggable.
(Question: Given a JGroups address how do we know if it is on the same physical box as
another JGroups address? If the machine has multiple NICs this may be tricky. Maybe we
need to propagate some other kind of machine id in the state too? Or we propagate a list
of NICs per box in the state.)
The JGroups address would be needed when doing in memory message persistence for
replicating messages from one server node to another.
When changes in the failover map occur due to nodes joining leaving the group, this needs
to be propagated to clients, and can probably be done on some kind of special control
channel that we can multiplex on top of the transport assuming we use our own multiplexing
transport.
When a failover occurs the first failover node detects the failure of the node by a change
in view and then takes over responsibility of the failed nodes partial queues.
If there are any persistent messages to load they are then loaded from storage. Loading
from storage won't be necessary if in memory message replication is being used since
the messages will already be in memory in the failed over node.
Around the same time, the client will receive an connection exception in the connection
listener and assume the server has failed (should we retry on the current server first in
case the failure was transitory and the server hasn't really failed? E.g. if someone
temporarily disconnected the network cable).
If the client determines the server really has failed then, it tries another server based
on its client side load balancing policy. This server may not be the correct failover
server for the failed node due to difficulties in synchronizing the client and server side
failover mapping.
In this case the server tells the client the locator of the correct server, and the client
tries to connect there. This process continues until the client finds the correct server
or a maximum number of tries is reached at which point it will give up.
When the client connects to the failover server the server failover process (reloading of
queues) may not have completed yet. In this case the server will block the invocation
until the failover is complete. (I.e. it won't send a response to the connect until it
is complete).
Once the client has successfully connected to the correct server it then recreates any
session, consumer, producer and browser objects that existed before failure for the failed
connection.
It then sends a list of <message_id, destination> corresponding to unacked messages
in each of the sessions on the connection that failed. Based on this the server recreates
the delivery list in the server consumer delates.
If the failed node is subsequently resurrected, then it is not such a simple matter to
just move the connections back to the original node since there may be unacknowledged
messages in live sessions. If we move the connections then we any non persistent messages
might get redelivered.
Therefore we can only safely move back connections if there are no unacked messages in any
sessions.
This is probably part of a bigger question of how we redistribute connections over many
nodes when we suddenly add a lot of nodes to the cluster.
For the first pass we should probably not bother since this is tricky. However if we want
to be able to automatically spread load smoothly and get benefits when adding new nodes to
cluster with already created connections we should consider this.
We should also consider being able to bring down a node smoothly from the management
console without losing sessions - i.e. move them transparently to another node. Again this
is not a high priority but something to think about.
Any more thoughts?
(BTW We should probably put all of this in a wiki...)
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3977644#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...