Mircea's initial fix [1] for ISPN-777 was on the right track but incomplete, as witnessed by comments here [2]:
I've made a few more changes [3] to this which will now be in 4.2.0.CR3. Anyway, the purpose of this email is to summarise my changes:
* Transactions self-destructing was only present on LockingInterceptor.visitLockControlCommand() [4]. This should also be on visiting prepare commands since this is the other place that locks can be acquired and a remote node disappearing.
* The check above is still not enough. There is a race between conducting the check on a transaction and actually acquiring a lock. E.g., a remote transaction may seem valid in [4] but by the time the thread acquires the lock, the node could have died rendering its transaction stale. For this purpose, I have added a validity field to RemoteTransaction [5] which is flagged whenever a RollbackCommand is invoked [6] [7]. This allows the stale transaction cleanup task to flag such transactions as invalid even if they are being processed on the fly. Finally, I have a check for this flag whenever locks are acquired and entries written to [8] and appropriate lock release if this is the case [9].
* I've also added a stress test that demonstrates this problem better (much more repeatable), based on the original test submitted by the reporter.
* Some other minor tweaks like better logging and toString() impls. :-)
Cheers
Manik
--
Manik Surtani
Lead, Infinispan
Lead, JBoss Cache
_______________________________________________