[infinispan-dev] XAResource.isSameRM

Thu Jan 6 05:45:04 EST 2011

On 5 Jan 2011, at 17:19, Jonathan Halliday wrote:

> On 01/05/2011 03:42 PM, Mircea Markus wrote:
>> 
>> On 5 Jan 2011, at 14:51, Mircea Markus wrote:
>> 
>>> FYI, a discussion I have with Jonathan around recovery support from TM
>>> 
>>> On 5 Jan 2011, at 14:43, Jonathan Halliday wrote:
>>>> On 01/05/2011 02:18 PM, Mircea Markus wrote:
>>>> 
>>>>> I don't know how the TM recovery process picks up the XAResource instance on which to call XAResource.recover, but I imagine it expects this method to return all the prepared(or heuristic completed) transactions from the _whole transaction branch_, i.e. from the entire cluster.
>>>> 
>>>> all from the logical RM, which you happen to implement as a cluster, yes.
>>>> 
>>>>> I'm asking this  because right now there's no way for a node to know all the prepared transaction in the entire cluster. This is doable but would involve an broadcast to query the cluster, which might be costly (time and bandwidth).
>>>> 
>>>> right. not to mention it should, strictly speaking, block or fail if any node is unreachable, which kinda sucks from an availability perspective.
>> So if a node does not respond to the broadcast, it is incorrect to return the prepared transactions received from the other nodes? (is this because the TM expects to receive some tx that it knows for sure to be prepared?) Or would a "best effort" be "good enough"? (e.g. I broadcast the query and return all the results received in 1 sec)
> 
> hmm, interesting question.
> 
> Keep in mind that the XA spec dates from a time when a typical large clustered RM was 2-3 oracle nodes on the same LAN segment. It simply isn't geared to a world where the number of nodes is so large and widely distributed that the probability of *all* of them being available simultaneously is pretty slim. Likewise the number of transaction managers connected to a resource was assumed to be small, often 1, rather than the large N we see on modern clusters / clouds. As a result, the spec either fails to give guidance on some issues because they weren't significant at the time it was written, or implies/mandates behaviour that is counter productive in modern environments.
> 
> Thus IMO some compromises are necessary to make XA usable in the real world, especially at scale. To further complicate matters, these are split across RM and TM, with different vendors having different views on the subject. My advice is geared to the way JBossTS drives XA recovery - other TMs may behave differently and make greater or lesser assumptions about compliance with the letter of the spec. As a result you may find that making your RM work with multiple vendor's TMs requires a) configuration options and b) a lot of painful testing.  Likewise JBossTS contains code paths and config options geared to dealing with bugs or non-compliant behaviour in various vendor's RMs.
> 
> Now, on to the specific question: The list returned should, strictly speaking, be complete. There are two problems with that. First, you have to be able to reach all your cluster nodes to build a complete list which, as previously mentioned, is pretty unlikely in a sufficiently large cluster. Your practical strategies are thus as you say: either a) throw an XAException(XAER_RMFAIL) if any node is unreachable within a reasonable timeout and accept that this may mean an unnecessary delay in recovering the subset of tx that are known or b) return a partial list on a best effort basis. The latter approach allows the transaction manager to deal with at least some of the in-doubt tx, which may in turn mean releasing resources/locks in the RM. In general I'd favour that option as having higher practical value in terms of allowing the best possible level of service to be maintained in the face of ongoing failures.
> 
> JBossTS will rescan every N minutes (2 by default) and thus you can simply include any newly discovered in-doubt tx as they become known due to e.g. partitioned nodes rejoining the cluster, and the TM will deal with them when they are first seen. Note however that some TMs assume that if they scan an RM and that RM does not subsequently crash, no new in-doubt transactions will occur except from heuristics. Let's gloss over how they can even detect a crash/recover of the RM if the driver masks it with failover or the event happens during a period when the TM makes no call on the driver. Such a TM will perform a recovery scan once at TM startup and not repeat. In such case you may have in-doubt tx from nodes unavailable at that crucial time subsequently sitting around for a prolonged period, tying up precious resources and potentially blocking subsequent updates. Most RM vendors provide some kind of management capability for admins to view and manually force completion of in-doubt tx. command line tool, jmx, web gui, whatever, just so long as it exists.
When a node crashes all the transactions that node owns (i.e. tx which were originated on that node and XAResource instance residing on that node) automatically rollback, so that no resources (locks mainly) are held. The only thing we need to make sure though is that the given transaction ids (the one that heuristically rollback) are returned by theXAResource.recover method - doable in the same way we handle prepares. I imagine that we'll have to keep these XIDs until XAResource.forget(XID) is called, am I right? Is it common/possible for people to use TM _without_ recovery? If so,  this "held heuristic completed TX" functionality should be configurable (enabled/disabled) in order to avoid memory leaks (no recovery means .forget never gets called)     

> Another interesting issue is what constitutes an 'in-doubt' tx. Pretty much all RMs will include heuristically completed tx in the recovery list. Some will include tx branches that have prepared but not yet committed or rolled back. Some will include such only if they have been in the prepared state for greater than some threshold length of time (a few seconds i.e. a couple of order of magnitude longer than a tx would normally be expected to hold that state). There is also the question of when a tx should be removed from the list. The wording of the spec
> 
> 'Two consecutive invocation of [recover] that starts from the beginning of the list must return the same list
> of transaction branches unless one of the following takes place:
>       - the transaction manager invokes the commit, forget, prepare, or rollback method for that resource
>       manager, between the two consecutive invocation of the recovery scan
> ...'
> 
> seems to imply a single transaction manager.
doesn't this also imply that the prepare-treshold isn't the spec's way? I.e. even though TM doesn't call any method on the RM , the RM returns a new XID in the result of XAResource.recover when the threshold is reached.  
> In cases where more than one TM is connected to the RM, the list clearly can't be considered stable between calls, as that would require prolonged blocking. Thus a reasonable TM should not expect a stable list. However, it's less clear how it should react to items that appear and disappear arbitrarily over consecutive calls to recover.
> 
> In the case of JBossTS, it assumes the RM may include tx branches that are actually proceeding normally in the results of a recovery scan. It thus caches the list, sleeps for an interval considered long enough for any normal prepared tx to complete, then rescans and compares the results. Any tx appearing in both scans is considered genuinely in need of recovery. If an RM includes in its recover results a normally proceeding tx branch and the TM does not perform such a backoff, it may, in a clustered environment, rollback a tx branch that another TM will try to commit a split second later, thus resulting in an unnecessary heuristic outcome. The worst scenario I've ever seen was a certain large DB vendor who considered tx to be in-doubt not just from the time they were prepared, but from the moment they were started. ouch.
> 
> Naturally most well behaved TMs will have some sense of tx branch ownership and not recover unrecognised tx, but it's one of those delightful gray areas where you may benefit from being a little paranoid.
So each TM would only care to recover the transactions it manages? Sort of makes sense.
> Which, as it happens, gives you a perfect excuse to return a partial list of tx - in terms of the way TM recovery works there is nothing to distinguish a tx that's missing due to node outage from one that is missing due to not yet having reached the in-doubt time threshold.
> 
> Of course it does not end with the recovery scan. Let's take the case where a specific tx is identified as in-doubt and returned from the recover call. The TM may then call e.g. commit on it. Processing that commit may involve the driver/server talking to multiple cluster nodes, some of which may not respond. Indeed this is the case during a normal commit too, not just one resulting from a recovery scan. You need to think very carefully about what the result of a failed call should be. A lot of bugs we've seen result from resource managers using incorrect XAException codes or transaction managers misinterpreting them. Be aware of the semantics of specific error codes and be as concise as possible when throwing errors.
Good to know! Something to discuss/review on our next meeting. 
> 
>> I would expect a single (periodic) XAResource.recover call per cluster(assuming the cluster is the RM) / transaction manager. Am I wrong?
> 
> Only in so far as you are assuming one logical TM per cluster, with cluster here meaning JBossAS cluster as distinct from a possible separate infinispan RM cluster.
> 
> In practice although a group of JBossAS servers may be clustering e.g. web sessions or EJBs, they don't cluster TMs. Each JBossAS instance is a logically separate and independently functioning TM and each is responsible for running its own recovery. Note that there ways of clustering the JTS version of the TM, but we don't advertise or support them and as a result to the best of my knowledge no end user actually runs with such configuration. Likewise for running a single out of process recovery subsystem which is shared by multiple TMs/JBossAS nodes.
> 
>>>> you mean a logically separate resource manager, yes. You are basically talking about not doing interposition in the driver/server but rather relying on the transaction manager to handle multiple resources. It may make your implementation simpler but probably less performant on the critical path (transaction commit) vs. recovery.
>> I don't know exactly how many RPC's happen in this approach, with TM handling multiple resources. I imagine the TM would do a XAResource.prepare for each of the nodes involved. In this XAResource.prepare call I would have to implement the logic of going remotely to each involved node. Then the same for XAResource.commit. Is that so? (If so then this is pretty much what we already do in ISPN when it comes to commit/rollback).
>> On of the advantages of allowing TM to handle each individual node is that we can benefit from some nice TM features like read-only optimisation or 1PC for single participants (these are to be implemented anyway in ISPN).
> 
> Although the RPC topology can matter, particularly where WAN hops are involved, the critical difference is actually where the logging responsibility lies.
> 
> If the TM has more than one XAResource enlisted, it has to write to disk between the prepare and commit phases. So where it has got e.g. two infinispan nodes as separate XAResources, the tx is bottlenecked on disk I/O even though the RM is entirely in (network distributed) RAM.
Right! I was looking in the wrong place.
> 
> Where the clustered RM presents itself as a single resource, the TM won't necessarily log, due to automatic one phase commit optimisation.
> 
> However...
> 
> In that case your driver/server has to guarantee correct behaviour in the event of node failure during the commit call. In other words, it has to do its own persistence if the state updates span more than one cluster node. In the case of infinispan that's likely to be via. additional RPCs for replication rather than by a disk write as in most database cluster RMs.
> 
> In short, you're trading disk I/O in the TM against additional network I/O in the clustered RM. All else being equal I think having the RM do the work will perform better at scale, but that's just a guess.
+1
> You more or less need the code for that either way, as even where a single infinispan node is a logically separate RM, it still needs to talk to its cluster mates (or the disk I guess) for persistence or it can't guarantee durability of the tx.
Thanks Jonathan!
> 
> Jonathan.
> 
> -- 
> ------------------------------------------------------------
> Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
> Registered in UK and Wales under Company Registration No. 3798903  Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland)