[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Thu Jun 14 05:23:14 EDT 2007

Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we 
recommended to remove FC when making synchronous method calls.

 From 1103:
1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so the 
CCL tries to do a clustered get. The CacheLoaderInterceptor lock for 
fqnA is held at this time.
2) Thread1 blocks in FC waiting for credits.
3) Replication message for fqnA arrives from node2.
4) IncomingPacketHandler thread will block waiting for the 
CacheLoaderInterceptor lock for fqnA.
5) FC credits cannot arrive, so we deadlock.

Why can credit responses not arrive ? This is 2.5, not 2.4 right ? In 
2.5, credit responses arrive as OOB messages, so they will always 
delivered, unless you disabled the OOB thread pool. Or is this 2.4 ? 
Then remove FC from the stack. Oh shit, just checked the case, this *is* 
1.4.1 SP3 ! Yes, then remove FC from the stack; because you make 
synchronous method calls, this is not an issue as we don't need flow 
control in that case.

Comments inline.

Manik Surtani wrote:
> Looking through this in detail, I see that the main problem is my 
> assumption that the timeout parameter in the JGroups RpcDispatcher's 
> callRemoteMethods() only starts ticking once the transport layer puts 
> the call on the wire.  IMO this is misleading and wrong, and should 
> start ticking the moment I make an API call into the RpcDispatcher.  
> The very fact that I provide a timeout means that I shouldn't ever 
> expect the call to block forever no matter what the circumstance (a 
> bit like calling Object.wait() with no params vs calling it with a 
> timeout).  Just had a chat with Jason to bounce a few ideas off him, 
> and here is what I can do in JBoss Cache to work around this (IMO an 
> ugly workaround):
>
> *** JBC workaround (does not work; just looked through FC code in HEAD 
> and it swallows InterruptedException - move on!)

Where does FC swallow an InterruptedException ? The only place where I 
catch an exception is in handleDownMessage():

catch(InterruptedException e) {
     // set the interrupted flag again, so the caller's thread can 
handle the interrupt as well
     Thread.currentThread().interrupt();
}

This does not swallow the exception; it rather passes it on to the 
calling thread, as suggested by JCIP.

> All calls to the RpcDispatcher register with a Map, containing a 
> reference to the Thread and the timeout before making the 
> RpcDispatcher call
> - JBC runs a separate "RpcTimeoutMonitor" thread which periodically 
> checks threads making RpcDispatcher calls against their timeouts, 
> interrupting those that have taken too long.
> - The code calling the RpcDispatcher wrapped by a try block, 
> attempting to catch interrupted exceptions, and throws a timeout 
> exception to signify RPC timeout.
>
> The problem with this approach is the extra overhead of a 
> RpcTimeoutMonitor thread.  The fact that the timeout will not be 100% 
> accurate is not a problem - a "best effort" is good enough, so even if 
> a call only times out after 2200 ms even though it was called with a 
> timeout param of 2000 should not be of consequence.  At least calls 
> don't get stuck, regardless of why or where in the RPC process it is 
> held up.
>
> *** More inefficient JBC workaround
>
> - All calls to the RpcDispatcher happen in a separate thread, using a 
> ThreadPoolExecutor
> - The app thread then waits for timeout ms, and if the RPC call hasn't 
> completed, throws a timeout exception - so even if the thread is stuck 
> in, say, FC, at least JBC can roll back the tx and release locks, etc.
>
> Inefficient because each and every RPC call will happen in a separate 
> thread + potential ugliness around orphaned threads stuck in a JGroups 
> protocol.
>
> *** JGroups fix - I think this can be done easily, where any blocking 
> operations in JGroups protocols make use of the timeout parameter.  
> Again, this will not provide 100% timeout accuracy. but a "best 
> effort", but like I said IMO this is ok.  (At the moment FC loops 
> until it has enough creds.  I think this loop should timeout using the 
> same timeout param.)  Now passing this param will involve a transient 
> field in Event which RequestCorrelator could use to set the timeout.  
> Protocols like FC can then use this timeout to determine how long it 
> shuld loop for when waiting for creds.
>
> Thoughts?  My preferred option is the last one, since it gives the 
> timeout param in the RpcDispatcher more meaning. 

No, that's a bad solution because the design of a flow control protocol 
should not be influenced by an application level workaround. In 
addition, if you have 500 threads, all timing out at the same time, you 
will have a steady flow of messages, defeating the purpose of flow 
control in the first place.

On top of that, we *cannot* do that because if we run on top of TCP, a 
write() might block anyway if the TCP receiver set the sliding window to 
0 ! So the sending of data on top of TCP will block (similar to FC) when 
TCP throttles the sender.

By the way, some decades ago the same issue of 'timed method calls' 
occurred in CORBA, e.g. invoke foo() but it should take 450ms tops.

What *could* be done here is to add an option to RpcDispatcher to use 
separate threads from a thread pool to dispatch requests and correlate 
responses. So, the caller would create a task (which sends the request 
and waits for all responses, possibly listening for cancellation), 
submit it to the pool and get a future. Then wait on the future for N 
milliseconds and return with the current results after that time, or 
throw a timeout exception, whatever.

This issue cannot and should not be solved at the FC level, by 
'bypassing' flow control ! Note that, under normal circumstances, and 
with 2.5, FC should never block for an extended time.

-- 
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat