[infinispan-dev] XAResource.isSameRM

Thu Jan 6 06:48:08 EST 2011

On 01/06/2011 10:45 AM, Mircea Markus wrote:

> When a node crashes all the transactions that node owns (i.e. tx which were originated on that node and XAResource instance residing on that node) automatically rollback, so that no resources (locks mainly) are held. The only thing we need to make sure though is that the given transaction ids (the one that heuristically rollback) are returned by theXAResource.recover method - doable in the same way we handle prepares. I imagine that we'll have to keep these XIDs until XAResource.forget(XID) is called, am I right?

I was under the impression a node does not own a tx. It may 
own a *branch* of that tx. Take the case where node 
JBossAS-1 starts a tx, propagates it to node JBossAS-2 and 
both JBossAS-1 and JBossAS-2 then simultaneously contact 
different nodes of the infinispan cluster in the scope of 
that tx. Each node would see a different branch of the same 
global tx. You presumably don't want to have to sync the 
ownership for each new tx across the cluster? The only way 
you could tie an entire tx to a single infinispan node is 
e.g. consistent hashing to force the decision of which node 
the driver connects to based on the tx context.

An XAResource does not reside in an infinispan node 
(although there may be something equivalent holding tx state 
on the server side) - it's a client/driver side construct. 
Given that the driver does transparent failover / load 
balancing and such, the XAResource can't be said to belong 
to a specific infinispan node unless you throw away some of 
the clustering availability advantages. It's really a 
question of where you are going to put the clustering 
intelligence - in a smart client side driver or in a server 
side component that acts as a kind of routing proxy.

With your 'rollback tx branch on node crash' model you are 
failing to provide ACID semantics for the cluster as a 
whole. You can abort them before the prepare stage, but post 
prepare it's not an option as you'll piss off clients who 
expect the cluster to behave correctly as long as a majority 
of its nodes survive. I'm not saying it's flat out wrong, 
just that it needs to be very clearly documented in order to 
avoid getting whined at by disgruntled users. My 
understanding is you're pitching infinispan not as a 
volatile cache, but an in-memory data grid. In that model 
the node does not own the tx, the cluster owns the tx and is 
responsible for masking node failures from the client. 
Rollback of prepared tx on node failure is therefore not an 
option - some part of the tx state may already have been 
committed by surviving nodes and you'll get inconsistencies. 
You need to replicate enough information to avoid that, 
otherwise the client app is going to have to explicitly 
provide logic to do the reconciliation, which sucks.

> Is it common/possible for people to use TM _without_ recovery? If so,  this "held heuristic completed TX" functionality should be configurable (enabled/disabled) in order to avoid memory leaks (no recovery means .forget never gets called)

It is not common. That said, JBossTS has for similar reasons 
got a 'give up after N hours' config option which will 
eventually abandon tx that have not recovered. It's off 
(i.e. never give up) by default but a small number of users 
find it handy. Most just use the admin tooling to manually 
clean up the small number of unrecoverable situations - it's 
safer in most cases.

>> Another interesting issue is what constitutes an 'in-doubt' tx. Pretty much all RMs will include heuristically completed tx in the recovery list. Some will include tx branches that have prepared but not yet committed or rolled back. Some will include such only if they have been in the prepared state for greater than some threshold length of time (a few seconds i.e. a couple of order of magnitude longer than a tx would normally be expected to hold that state). There is also the question of when a tx should be removed from the list. The wording of the spec
>>
>> 'Two consecutive invocation of [recover] that starts from the beginning of the list must return the same list
>> of transaction branches unless one of the following takes place:
>>        - the transaction manager invokes the commit, forget, prepare, or rollback method for that resource
>>        manager, between the two consecutive invocation of the recovery scan
>> ...'
>>
>> seems to imply a single transaction manager.

> doesn't this also imply that the prepare-treshold isn't the spec's way? I.e. even though TM doesn't call any method on the RM , the RM returns a new XID in the result of XAResource.recover when the threshold is reached.

The definition of what constitutes an 'in-doubt' tx for 
purposes of inclusion in the recovery list is not well 
defined by the spec.

In a world where there is only one TM driving the RM and 
that TM performs recovery before starting up and running new 
tx, no new tx will be added to the recovery list. In the 
real world new ones are being added continuously as the 
system is always under load.

The concern is more around the stability of the list with 
respect to not *removing* things except in response to TM 
activity. i.e. if a node goes away, should you return a 
cached snapshot of its tx for list stability, or exclude 
them? both options carry risks.

Jonathan.

-- 
------------------------------------------------------------
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 
Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 
3798903  Directors: Michael Cunningham (USA), Charlie Peters 
(USA), Matt Parsons (USA) and Brendan Lane (Ireland)