[jboss-as7-dev] Examples of heuristic outcome and other transaction errors

Jonathan Halliday jonathan.halliday at redhat.com
Mon Jul 11 07:00:49 EDT 2011


On 07/11/2011 12:31 AM, Misty Stanley-Jones wrote:
> Hi all,
>
> I am documenting Transactions for EAP6, and I need some information and examples about:
>
> * Heuristic outcomes
> * Error handling in general
>
> Specifically I would like to see examples of error messages, and some troubleshooting techniques. I am at a loss to know how to write examples that will generate errors. Thanks for your help!
>

hmm yes, forcing heuristic outcomes is not easy - the system does 
everything in its power to avoid them, so making one occur is hard work. 
However it's a common request from QE, GSS and the occasional customer, 
so here is the stock answer:

1) run under high load and disable, either in software or hardware, one 
or more of the resource managers e.g. physically pull out the network 
cable connecting the app server to the resource manager.  This is a 
probabilistic outcome - if the load is high enough chances are you'll 
hit at least one transaction at the right point in the lifecycle to 
cause a heuristic. Finding it in the huge server log files may take a 
while though :-)

2) Run a single transaction, hold it on the debugger at the critical 
point, disable the RM and allow execution to continue. More elegant, but 
requires some understanding of the source code (hint: BasicAction.prepare)

3) Write a dummy XAResource for enlistment into the transaction and have 
it fail at an appropriate point e.g. in the commit call.

4) Use Byteman to instrument an existing XAResource to fail at an 
appropriate point.

These approaches all have one thing in common: they are intended for 
software engineers. Unless you have a tame developer to hand, your best 
bet will be to find a pre-existing example. The TS testsuite has 
several, although not in the most user-friendly form. There was also 
some work done for the anti-FUD video which is probably still hanging 
around somewhere.

As for troubleshooting advise:

Heuristics typically fall into two groups - those caused by transient 
failures in the environment and those caused by coding errors. The 
former are more common in production or stress tests, whilst the latter 
are normally weeded out at development time.

Heuristics caused by transient failures rarely need much in the way of 
root cause analysis - the network break / db server outage / power loss 
/ etc is normally readily apparent. The tricky bit is resolving by hand 
any transactions that are in a heuristically completed state. The TS 
system only automatically resolves pending transactions during recovery, 
not heuristically completed ones. Cleaning those up can require 
identifying which RMs were involved, examining state in the transaction 
manager and RMs and manually forcing log cleanup and data reconciliation 
in one or more of them. It's not easy and the exact steps depend on the 
specific RMs involved (each has its own management tooling) and the data 
content of the transaction branch in the RMs.  I'm in the process of 
adding additional logging output to assist with this type of thing, but 
from the point of view of the AS7/EAP6 user the biggest problem right 
now is we're missing an integrated tx objectstore browser frontend :-(

Heuristics caused by misbehaving (i.e. not spec compliant) RMs are less 
common but do occur occasionally. Unless the user is also the RM author, 
troubleshooting these is normally a job for GSS and the TS devs. The 
root cause is normally some disagreement or misunderstanding in the 
interpretation of an arcane point in the spec. In practical terms the 
cleanup is simpler, as in dev environments simply deleting all the 
relevant logs records in the RM and app server will reset things - data 
reconciliation is rarely a significant concern here.

As for general (non-heuristic) error handling in transactions, the key 
cases are:

- transaction timed out in the background and the business logic thread 
did not notice. This frequently manifests in hibernate suddenly being 
unable to get a db connection for lazy loading.  begin a new hibernate 
session. lengthen the timeout, restructure your code or tune the 
environment if it's happening frequently.

- transaction already running on a thread, normally because thread 
pooling is used and something forgot to disassociate an old transaction 
before releasing the thread back to the pool. Use finally blocks for 
cleanup and be aware that TransactionManager.[commit|rollback] do 
disassociation but Transaction.[commit|rollback] don't. Yes I'm looking 
at you Mr. Spring Framework.

- recovery failed due to non-serializable resource. Usually caused by 
failing to configure the recovery plugin properly and hopefully a thing 
of the past now that is at last done automatically by the server :-)

- the system disallowed enlistment of a second local resource. This is 
just a problem of user education - the difference between local-ds and 
xa-ds is not well understood.

I think those cover a majority of the support load, but the GSS guys may 
have others.

Jonathan.

-- 
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod 
Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 
Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons 
(USA) and Brendan Lane (Ireland).


More information about the jboss-as7-dev mailing list