Michael Musgrove [
https://community.jboss.org/people/mmusgrov] created the discussion
"Ideas for speeding up resource recovery"
To view the discussion, visit:
https://community.jboss.org/message/806978#806978
--------------------------------------------------------------
*h2. Problem:*
For various reasons the recovery system cannot be migrated to a different host meaning
that in failure scenarios transactions cannot always be recovered in a timely manner.
Initial discussion of the problem is captured in the following JIRAs:
https://issues.jboss.org/browse/JBTM-312 https://issues.jboss.org/browse/JBTM-312
https://issues.jboss.org/browse/JBTM-1359 https://issues.jboss.org/browse/JBTM-1359
*h2. Background:*
The Narayana project is logically comprised of a transaction manager (TM) and a recovery
manager (RcMgr). When the TM decides to commit a transaction it indelibly records details
of each participant involved in the transaction in an Object Store (ObjStore) (we have
many object store implementations but this is not particularly relevant to the problem at
hand). If the TM cannot complete the transaction, due to failures in the system,
responsibility for completing the transaction passes to the RcMgr which takes its input
from the contents of the ObjStore.
In a JTA environment we record representations of participants in the ObjStore as "XA
records" identified by XIDs which in turn embed a node id. This node id corresponds
to the TM instance that enlisted the participant into the transaction. In this way the
indelible record of the participant can be traced back to the initiating TM (but more
importantly it provides a way to pass responsibility for progressing the participant to a
particular RcMgr).
Now conventionally we run the TM and RcMgr within a JEE app server in which the default
configuration is to operate on the same ObjStore. In this configuration the RcMgr will
only recover ObjStore records whose node id matches the co located TM. Generally this
scheme works fine (although separating them would increase availability) since the
recovery system has access to the same set of XA resources that TM does so recovery will
proceed as soon as the relevant XA resources become available. Problems may arise if:
1. we require a different recovery system to try to recover the same node ids;
2. multiple recovery systems try to recover the same node ids;
3. some of XA Resources become permanently inaccessible (either due to configuration
changes or because information embedded in records contains hardcoded host addresses which
no longer exist);
4. the node ids are changed;
5. the underlying object store becomes inaccessible.
In the following we primarily discuss JTS but REST-AT is similar. We also briefly alude to
XTS and largely ignore JTA since JTA should "just work" with the proposed
solution. But the real question will be: will this approach work, does it contain any show
stoppers and are there any issues, XTS aside, which have not been addressed?
*h2. Solution:*
This post proposes a solution that will overcome each of these problems by grouping a set
of transaction managers into an HA cluster with each TM configured to use the same set of
datasources and a single shared object store. The recovery system runs on a single node in
the cluster and operates on the shared object store. Participants recover on the same node
as the recovery system (which in general will be a different node from the one where the
participant last failed). This approach resolves the above mentioned problems in the
following manner:
1. uses an HA singleton capable of recovering all node ids. A simple approach is to
configure the node id list with a wild card (i.e. all XA records become eligible for
recovery regardless of the node id). This means we do not need to monitor changes in
cluster membership. As the singleton migrates around the cluster it is responsible for
starting the recovery system on the new node (if the singleton migrates for reasons other
than node failure it is responblibe for stopping the originial RcMgr);
2. ensure that only one node is driving top down recovery (this will be the HA
singleton);
3. document the responsibilities on the administrator i.e. each node must be configured
with the same set of datasources (we log the JNDI name of the resource so as long as the
JNDI names don't change we should be fine). Refer to the next section for how to
resolve the third issue regarding IP addresses changing;
4. provided the recovery system recovers all nodes (as required in 1 above) this is not an
issue;
5. use high availability storage (such as a database or a Storage Area Network) for the
ObjStore or use data replication (if customers really want this we should defer to a
future PRD since it is non trivial).
*h2. Problems:*
[Note that the default algorithm for constructing UIDs embeds the IP address of the host
on which the TM is running but this is fine since we only do so in order to construct
unique ids]
In JTS the recovery system and resources communicate via IORs which embed endpoint
information that refer to the IP address of where the objects can be contacted. This
feature creates the two difficulties:
1. If a participant resource migrates to a new host then the IOR of the participant held
by the RcMgr (running as a HA singleton) becomes invalid;
2. If the RcMgr migrates to a new host then participants will no longer be able to call
back to the recovery coordinator (as occurs during bottom up recovery) since the IOR of
the RcMgr will change.
The first difficulty is resolved by bottom up recovery:- when the participant is recreated
on the RcMgr node by the recovery system (note that the record that was logged at prepare
time contains the information needed to do this) it will inform the RcMgr hat it has moved
to a new location (using the replay_completion call on the recovery coordinator). There is
a window where this scheme will fail since the RcMgr may migrate after the participant has
migrated but before the participant has contacted the RcMgr. In note 2 below we mention
that since the ObjStore is globally accessible and since the RcMgr IOR is written to the
ObjStore, participants will always be able discover the RcMgr so it should be a straight
forward modification to the RcMgr to close this window.
The second difficulty becomes an issue if the RcMgr does not have a record of the
participant (can happen if the TM fails before writing its prepare log but after some of
the participants recorded their prepare decision). In this case top down recovery cannot
proceed, nor can bottom up recovery since the IOR of the RcMgr is no longer valid. Since
the participant and RcMgr are now running on the same node it should be a fairly non
invasive modification to the RcMgr to close this window.
*h2. Notes:*
1. With this proposed solution the master node becomes a bottle neck since that is where
participant resources recover but this is also a benefit since now the recovery
coordinator and participants are co-located (conventionally they all reside on different
nodes) thus speeding up recovery;
2. The IOR of the RcMgr is written to the ObjStore so each time the RcMgr migrates it must
rewrite this IOR;
3. Analysis of the effect on REST-AT is similar to the JTS case (which is expected since
REST-AT was modeled on JTS);
4. Analysis of the effect of the solution on XTS needs to be discussed certainly
participant and coordinator endpoints will no longer be valid. If the solution fails for
XTS we may need to fallback on demanding IP failover to support the XTS case.
*h2. Restrictions:*
* only works in an HA cluster (not a problem since this is an EAP feature)
* JNDI names of XA resources must be duplicated on all nodes in the cluster
* the ObjStore must use storage that is accessible to all nodes and be highly available
* databases cannot be shared by Application Servers not in the cluster (see next section
for details)
*h2. Problems:*
1. An HA singleton is a bean marked with the @Singleton annotation. This bean will be
responsible for starting the recovery system. Since beans are started after AS subsystems
we have the problem that recovery will start much later than it does at present.
2. Databases/XA Resources are typically used by more than one application server and when
asked for in doubt transactions they will return all of them. This will be a problem if
the returned list contains transactions created by a server that is not part of the
cluster since it is then possible for two different recovery systems to recover the same
transaction branch. We can resolve this problem by either demanding that resources are a
dedicated to the cluster or we configure the RcMgr to only recover a given set of node
ids. To implement the latter option requires that we implement a cluster membership
monitor which tells the RcMgr when nodes come and go (which adds complexity to the
solution).
--------------------------------------------------------------
Reply to this message by going to Community
[
https://community.jboss.org/message/806978#806978]
Start a new discussion in JBoss Transactions Development at Community
[
https://community.jboss.org/choose-container!input.jspa?contentType=1&...]