[infinispan-dev] heuristic transactions & failure recovery

Mon Apr 6 12:01:27 EDT 2009

On 6 Apr 2009, at 15:12, Mircea Markus wrote:

> Manik Surtani wrote:
>>
>> On 4 Apr 2009, at 16:16, Mircea Markus wrote:
>>
>>> Hi,
>>>
>>> Current implementation of tx in JBC/infinispan might result in  
>>> heuristic transactions: e.g. if the coordinator cannot send an  
>>> commit message (2nd phase from 2PC) within a given timeout to some  
>>> of the participants, this might results in data being committed on  
>>> some nodes and rollbacked on other.
>>
>> ?  If the coord (and I assume you mean the transaction coordinator,  
>> not the JGroups channel coordinator) doesn't broadcast a commit,  
>> none of the other nodes would have committed this state.  I don't  
>> see how you have a situation where it is committed on some and  
>> rolled back on others.
>>
>> Perhaps you mean if the tx coordinator has broadcast a commit, some  
>> receive the commit and before all receive the commit the tx  
>> coordinator dies.
> yes, this is the scenario I had in mind.
>> And you are not using multicast (if you are they all receive the  
>> commit message at the same time).  But we recommend you use  
>> multicast anyway so I'm not so sure if this is such a problem.
> Generally speaking not all messages are received *at the same time*.  
> JGrous only guarantees that they will be received.
> Let's say that we have 3 nodes, A B and C. A starts tx, does a put  
> ("k","v") then commits tx. During commit following happen:
> 1) prepare is broadcasted
>   B prepares and holds locks
>   C prepares and holds locks
> 2) A sees B and C voted okay,so triggers a commit:
> - B receives the commit msg and applies changes (for good!)
> - A does not manage to send the message to C *in the given timeout*.  
> At this point, the RPC call returns and A rollbacks, also C will  
> rollback after a while (tx timeout). But B will have the changes  
> applied, and this will result in an atomicity being violated.

Yes, but this is allowed in 2PC.  This leaves the tx in a state of  
STATUS_UNKNOWN, and it is up to the transaction manager to initiate a  
recovery *if* the resources are XA compliant and support recovery.

>>> Even worse, there is no way to take action and recover from the  
>>> failure. Would it make sense to have tx failure recovery   
>>> mechanism in  infinispan?
>>
>> Well, it depends.  If it is used as a cache for a db, then  
>> "recovery" is to just empty the cache.  Otherwise, if you want to  
>> treat it as a distributed in-memory db, "recovery" here would mean  
>> emptying the cache instance in question, and doing a state transfer  
>> from a neighbour (REPL) or re-hashing keys (DIST).
>>
> Yes. But right now, if a situation like the one I described happens  
> no admin will be notified, and inconsistent resources will be  
> exposed to users. I'm thinking about a recovery mechanism in which  
> (continuing previous example).
> - C to keep locks on resources and not allow users to see them until  
> it can take a decision
> - when communication between A and C is established, A to inform C  
> that it should rollback the tx
> (Of course this is a simplistic solution, the problem is more  
> complex, e.g. A might die in between).
>
>>> I'm referring  here to something similar to the way DBs work, i.e.  
>>> based on an persistent tx logs, external notifications etc? Even  
>>> though I didn't see any such request on forums, I guess such a  
>>> feature is mandatory for certain systems, e.g. a financial  
>>> application. Wdyt?
>>
>> Persistent tx logs can be just as error-prone, unless you  
>> checkpoint open files to disk via OS system calls to ensure all  
>> kernel and hardware caches are flushed.  But this is *very* slow.
> Agreed. But it assures correctness for ones that need it.
>>
>> AFAIK the way DBs do this - including Oracle - is to checkpoint at  
>> intervals, but this still allows for windows where your persistent  
>> tx log could be out of date or corrupt.
> Not sure about that - the logs can be used, in the case of heuristic  
> tx, for moving the system to an consistent state.

Provided there were no system failures between the collecting and the  
storing of these logs.  :-)

--
Manik Surtani
manik at jboss.org
Lead, JBoss Cache
http://www.jbosscache.org