On 4 Apr 2009, at 16:16, Mircea Markus wrote:

Hi,

Current implementation of tx in JBC/infinispan might result in heuristic transactions: e.g. if the coordinator cannot send an commit message (2nd phase from 2PC) within a given timeout to some of the participants, this might results in data being committed on some nodes and rollbacked on other.

?  If the coord (and I assume you mean the transaction coordinator, not the JGroups channel coordinator) doesn't broadcast a commit, none of the other nodes would have committed this state.  I don't see how you have a situation where it is committed on some and rolled back on others.

Perhaps you mean if the tx coordinator has broadcast a commit, some receive the commit and before all receive the commit the tx coordinator dies.  And you are not using multicast (if you are they all receive the commit message at the same time).  But we recommend you use multicast anyway so I'm not so sure if this is such a problem.

Even worse, there is no way to take action and recover from the failure. Would it make sense to have tx failure recovery  mechanism in  infinispan?

Well, it depends.  If it is used as a cache for a db, then "recovery" is to just empty the cache.  Otherwise, if you want to treat it as a distributed in-memory db, "recovery" here would mean emptying the cache instance in question, and doing a state transfer from a neighbour (REPL) or re-hashing keys (DIST).

 I'm referring  here to something similar to the way DBs work, i.e. based on an persistent tx logs, external notifications etc? Even though I didn't see any such request on forums, I guess such a feature is mandatory for certain systems, e.g. a financial application. Wdyt?

Persistent tx logs can be just as error-prone, unless you checkpoint open files to disk via OS system calls to ensure all kernel and hardware caches are flushed.  But this is *very* slow.  

AFAIK the way DBs do this - including Oracle - is to checkpoint at intervals, but this still allows for windows where your persistent tx log could be out of date or corrupt.

Cheers
--
Manik Surtani
Lead, JBoss Cache