[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Wed Jun 13 11:52:09 EDT 2007

Looking through this in detail, I see that the main problem is my  
assumption that the timeout parameter in the JGroups RpcDispatcher's  
callRemoteMethods() only starts ticking once the transport layer puts  
the call on the wire.  IMO this is misleading and wrong, and should  
start ticking the moment I make an API call into the RpcDispatcher.   
The very fact that I provide a timeout means that I shouldn't ever  
expect the call to block forever no matter what the circumstance (a  
bit like calling Object.wait() with no params vs calling it with a  
timeout).  Just had a chat with Jason to bounce a few ideas off him,  
and here is what I can do in JBoss Cache to work around this (IMO an  
ugly workaround):

*** JBC workaround (does not work; just looked through FC code in  
HEAD and it swallows InterruptedException - move on!)

- All calls to the RpcDispatcher register with a Map, containing a  
reference to the Thread and the timeout before making the  
RpcDispatcher call
- JBC runs a separate "RpcTimeoutMonitor" thread which periodically  
checks threads making RpcDispatcher calls against their timeouts,  
interrupting those that have taken too long.
- The code calling the RpcDispatcher wrapped by a try block,  
attempting to catch interrupted exceptions, and throws a timeout  
exception to signify RPC timeout.

The problem with this approach is the extra overhead of a  
RpcTimeoutMonitor thread.  The fact that the timeout will not be 100%  
accurate is not a problem - a "best effort" is good enough, so even  
if a call only times out after 2200 ms even though it was called with  
a timeout param of 2000 should not be of consequence.  At least calls  
don't get stuck, regardless of why or where in the RPC process it is  
held up.

*** More inefficient JBC workaround

- All calls to the RpcDispatcher happen in a separate thread, using a  
ThreadPoolExecutor
- The app thread then waits for timeout ms, and if the RPC call  
hasn't completed, throws a timeout exception - so even if the thread  
is stuck in, say, FC, at least JBC can roll back the tx and release  
locks, etc.

Inefficient because each and every RPC call will happen in a separate  
thread + potential ugliness around orphaned threads stuck in a  
JGroups protocol.

*** JGroups fix - I think this can be done easily, where any blocking  
operations in JGroups protocols make use of the timeout parameter.   
Again, this will not provide 100% timeout accuracy. but a "best  
effort", but like I said IMO this is ok.  (At the moment FC loops  
until it has enough creds.  I think this loop should timeout using  
the same timeout param.)  Now passing this param will involve a  
transient field in Event which RequestCorrelator could use to set the  
timeout.  Protocols like FC can then use this timeout to determine  
how long it shuld loop for when waiting for creds.

Thoughts?  My preferred option is the last one, since it gives the  
timeout param in the RpcDispatcher more meaning.

Cheers,
--
Manik Surtani

Lead, JBoss Cache
JBoss, a division of Red Hat