On 5 Jan 2011, at 17:19, Jonathan Halliday wrote:
On 01/05/2011 03:42 PM, Mircea Markus wrote:
>
> On 5 Jan 2011, at 14:51, Mircea Markus wrote:
>
>> FYI, a discussion I have with Jonathan around recovery support from TM
>>
>> On 5 Jan 2011, at 14:43, Jonathan Halliday wrote:
>>> On 01/05/2011 02:18 PM, Mircea Markus wrote:
>>>
>>>> I don't know how the TM recovery process picks up the XAResource
instance on which to call XAResource.recover, but I imagine it expects this method to
return all the prepared(or heuristic completed) transactions from the _whole transaction
branch_, i.e. from the entire cluster.
>>>
>>> all from the logical RM, which you happen to implement as a cluster, yes.
>>>
>>>> I'm asking this because right now there's no way for a node to
know all the prepared transaction in the entire cluster. This is doable but would involve
an broadcast to query the cluster, which might be costly (time and bandwidth).
>>>
>>> right. not to mention it should, strictly speaking, block or fail if any node
is unreachable, which kinda sucks from an availability perspective.
> So if a node does not respond to the broadcast, it is incorrect to return the
prepared transactions received from the other nodes? (is this because the TM expects to
receive some tx that it knows for sure to be prepared?) Or would a "best effort"
be "good enough"? (e.g. I broadcast the query and return all the results
received in 1 sec)
hmm, interesting question.
Keep in mind that the XA spec dates from a time when a typical large clustered RM was 2-3
oracle nodes on the same LAN segment. It simply isn't geared to a world where the
number of nodes is so large and widely distributed that the probability of *all* of them
being available simultaneously is pretty slim. Likewise the number of transaction managers
connected to a resource was assumed to be small, often 1, rather than the large N we see
on modern clusters / clouds. As a result, the spec either fails to give guidance on some
issues because they weren't significant at the time it was written, or
implies/mandates behaviour that is counter productive in modern environments.
Thus IMO some compromises are necessary to make XA usable in the real world, especially
at scale. To further complicate matters, these are split across RM and TM, with different
vendors having different views on the subject. My advice is geared to the way JBossTS
drives XA recovery - other TMs may behave differently and make greater or lesser
assumptions about compliance with the letter of the spec. As a result you may find that
making your RM work with multiple vendor's TMs requires a) configuration options and
b) a lot of painful testing. Likewise JBossTS contains code paths and config options
geared to dealing with bugs or non-compliant behaviour in various vendor's RMs.
Now, on to the specific question: The list returned should, strictly speaking, be
complete. There are two problems with that. First, you have to be able to reach all your
cluster nodes to build a complete list which, as previously mentioned, is pretty unlikely
in a sufficiently large cluster. Your practical strategies are thus as you say: either a)
throw an XAException(XAER_RMFAIL) if any node is unreachable within a reasonable timeout
and accept that this may mean an unnecessary delay in recovering the subset of tx that are
known or b) return a partial list on a best effort basis. The latter approach allows the
transaction manager to deal with at least some of the in-doubt tx, which may in turn mean
releasing resources/locks in the RM. In general I'd favour that option as having
higher practical value in terms of allowing the best possible level of service to be
maintained in the face of ongoing failures.
JBossTS will rescan every N minutes (2 by default) and thus you can simply include any
newly discovered in-doubt tx as they become known due to e.g. partitioned nodes rejoining
the cluster, and the TM will deal with them when they are first seen. Note however that
some TMs assume that if they scan an RM and that RM does not subsequently crash, no new
in-doubt transactions will occur except from heuristics. Let's gloss over how they can
even detect a crash/recover of the RM if the driver masks it with failover or the event
happens during a period when the TM makes no call on the driver. Such a TM will perform a
recovery scan once at TM startup and not repeat. In such case you may have in-doubt tx
from nodes unavailable at that crucial time subsequently sitting around for a prolonged
period, tying up precious resources and potentially blocking subsequent updates. Most RM
vendors provide some kind of management capability for admins to view and manually force
completion of in-doubt tx. command line tool, jmx, web gui, whatever, just so long as it
exists.
When a node crashes all the transactions that node owns (i.e. tx which were
originated on that node and XAResource instance residing on that node) automatically
rollback, so that no resources (locks mainly) are held. The only thing we need to make
sure though is that the given transaction ids (the one that heuristically rollback) are
returned by theXAResource.recover method - doable in the same way we handle prepares. I
imagine that we'll have to keep these XIDs until XAResource.forget(XID) is called, am
I right? Is it common/possible for people to use TM _without_ recovery? If so, this
"held heuristic completed TX" functionality should be configurable
(enabled/disabled) in order to avoid memory leaks (no recovery means .forget never gets
called)
Another interesting issue is what constitutes an 'in-doubt'
tx. Pretty much all RMs will include heuristically completed tx in the recovery list. Some
will include tx branches that have prepared but not yet committed or rolled back. Some
will include such only if they have been in the prepared state for greater than some
threshold length of time (a few seconds i.e. a couple of order of magnitude longer than a
tx would normally be expected to hold that state). There is also the question of when a tx
should be removed from the list. The wording of the spec
'Two consecutive invocation of [recover] that starts from the beginning of the list
must return the same list
of transaction branches unless one of the following takes place:
- the transaction manager invokes the commit, forget, prepare, or rollback method
for that resource
manager, between the two consecutive invocation of the recovery scan
...'
seems to imply a single transaction manager.
doesn't this also imply that the
prepare-treshold isn't the spec's way? I.e. even though TM doesn't call any
method on the RM , the RM returns a new XID in the result of XAResource.recover when the
threshold is reached.
In cases where more than one TM is connected to the RM, the list
clearly can't be considered stable between calls, as that would require prolonged
blocking. Thus a reasonable TM should not expect a stable list. However, it's less
clear how it should react to items that appear and disappear arbitrarily over consecutive
calls to recover.
In the case of JBossTS, it assumes the RM may include tx branches that are actually
proceeding normally in the results of a recovery scan. It thus caches the list, sleeps for
an interval considered long enough for any normal prepared tx to complete, then rescans
and compares the results. Any tx appearing in both scans is considered genuinely in need
of recovery. If an RM includes in its recover results a normally proceeding tx branch and
the TM does not perform such a backoff, it may, in a clustered environment, rollback a tx
branch that another TM will try to commit a split second later, thus resulting in an
unnecessary heuristic outcome. The worst scenario I've ever seen was a certain large
DB vendor who considered tx to be in-doubt not just from the time they were prepared, but
from the moment they were started. ouch.
Naturally most well behaved TMs will have some sense of tx branch ownership and not
recover unrecognised tx, but it's one of those delightful gray areas where you may
benefit from being a little paranoid.
So each TM would only care to recover the
transactions it manages? Sort of makes sense.
Which, as it happens, gives you a perfect excuse to return a partial
list of tx - in terms of the way TM recovery works there is nothing to distinguish a tx
that's missing due to node outage from one that is missing due to not yet having
reached the in-doubt time threshold.
Of course it does not end with the recovery scan. Let's take the case where a
specific tx is identified as in-doubt and returned from the recover call. The TM may then
call e.g. commit on it. Processing that commit may involve the driver/server talking to
multiple cluster nodes, some of which may not respond. Indeed this is the case during a
normal commit too, not just one resulting from a recovery scan. You need to think very
carefully about what the result of a failed call should be. A lot of bugs we've seen
result from resource managers using incorrect XAException codes or transaction managers
misinterpreting them. Be aware of the semantics of specific error codes and be as concise
as possible when throwing errors.
Good to know! Something to discuss/review on our
next meeting.
> I would expect a single (periodic) XAResource.recover call per cluster(assuming the
cluster is the RM) / transaction manager. Am I wrong?
Only in so far as you are assuming one logical TM per cluster, with cluster here meaning
JBossAS cluster as distinct from a possible separate infinispan RM cluster.
In practice although a group of JBossAS servers may be clustering e.g. web sessions or
EJBs, they don't cluster TMs. Each JBossAS instance is a logically separate and
independently functioning TM and each is responsible for running its own recovery. Note
that there ways of clustering the JTS version of the TM, but we don't advertise or
support them and as a result to the best of my knowledge no end user actually runs with
such configuration. Likewise for running a single out of process recovery subsystem which
is shared by multiple TMs/JBossAS nodes.
>>> you mean a logically separate resource manager, yes. You are basically
talking about not doing interposition in the driver/server but rather relying on the
transaction manager to handle multiple resources. It may make your implementation simpler
but probably less performant on the critical path (transaction commit) vs. recovery.
> I don't know exactly how many RPC's happen in this approach, with TM handling
multiple resources. I imagine the TM would do a XAResource.prepare for each of the nodes
involved. In this XAResource.prepare call I would have to implement the logic of going
remotely to each involved node. Then the same for XAResource.commit. Is that so? (If so
then this is pretty much what we already do in ISPN when it comes to commit/rollback).
> On of the advantages of allowing TM to handle each individual node is that we can
benefit from some nice TM features like read-only optimisation or 1PC for single
participants (these are to be implemented anyway in ISPN).
Although the RPC topology can matter, particularly where WAN hops are involved, the
critical difference is actually where the logging responsibility lies.
If the TM has more than one XAResource enlisted, it has to write to disk between the
prepare and commit phases. So where it has got e.g. two infinispan nodes as separate
XAResources, the tx is bottlenecked on disk I/O even though the RM is entirely in (network
distributed) RAM.
Right! I was looking in the wrong place.
Where the clustered RM presents itself as a single resource, the TM won't necessarily
log, due to automatic one phase commit optimisation.
However...
In that case your driver/server has to guarantee correct behaviour in the event of node
failure during the commit call. In other words, it has to do its own persistence if the
state updates span more than one cluster node. In the case of infinispan that's likely
to be via. additional RPCs for replication rather than by a disk write as in most database
cluster RMs.
In short, you're trading disk I/O in the TM against additional network I/O in the
clustered RM. All else being equal I think having the RM do the work will perform better
at scale, but that's just a guess.
+1
You more or less need the code for that either way, as even where a
single infinispan node is a logically separate RM, it still needs to talk to its cluster
mates (or the disk I guess) for persistence or it can't guarantee durability of the
tx.
Thanks Jonathan!
Jonathan.
--
------------------------------------------------------------
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor,
Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael
Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland)