[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Thu Jun 14 08:48:39 EDT 2007

This is http session repl; the overall cache is REPL_ASYNC, so FC is 
needed.  It is only the "clustered get" call made by the 
ClusteredCacheLoader that makes a sync call.  The CCL directly uses the 
RpcDispatcher and uses a sync call since that is the semantic needed for 
that particular call.

Background -- use of CCL is an experiment to try to get around problems 
with initial state transfer with large states.

Good point about OOB in 2.5; that should prevent this situation. :)

Bela Ban wrote:
> Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we 
> recommended to remove FC when making synchronous method calls.
> 
>  From 1103:
> 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so the 
> CCL tries to do a clustered get. The CacheLoaderInterceptor lock for 
> fqnA is held at this time.
> 2) Thread1 blocks in FC waiting for credits.
> 3) Replication message for fqnA arrives from node2.
> 4) IncomingPacketHandler thread will block waiting for the 
> CacheLoaderInterceptor lock for fqnA.
> 5) FC credits cannot arrive, so we deadlock.
> 
> 
> Why can credit responses not arrive ? This is 2.5, not 2.4 right ? In 
> 2.5, credit responses arrive as OOB messages, so they will always 
> delivered, unless you disabled the OOB thread pool. Or is this 2.4 ? 
> Then remove FC from the stack. Oh shit, just checked the case, this *is* 
> 1.4.1 SP3 ! Yes, then remove FC from the stack; because you make 
> synchronous method calls, this is not an issue as we don't need flow 
> control in that case.
> 
> Comments inline.
> 
> Manik Surtani wrote:
>> Looking through this in detail, I see that the main problem is my 
>> assumption that the timeout parameter in the JGroups RpcDispatcher's 
>> callRemoteMethods() only starts ticking once the transport layer puts 
>> the call on the wire.  IMO this is misleading and wrong, and should 
>> start ticking the moment I make an API call into the RpcDispatcher.  
>> The very fact that I provide a timeout means that I shouldn't ever 
>> expect the call to block forever no matter what the circumstance (a 
>> bit like calling Object.wait() with no params vs calling it with a 
>> timeout).  Just had a chat with Jason to bounce a few ideas off him, 
>> and here is what I can do in JBoss Cache to work around this (IMO an 
>> ugly workaround):
>>
>> *** JBC workaround (does not work; just looked through FC code in HEAD 
>> and it swallows InterruptedException - move on!)
> 
> Where does FC swallow an InterruptedException ? The only place where I 
> catch an exception is in handleDownMessage():
> 
> catch(InterruptedException e) {
>     // set the interrupted flag again, so the caller's thread can handle 
> the interrupt as well
>     Thread.currentThread().interrupt();
> }
> 
> 
> This does not swallow the exception; it rather passes it on to the 
> calling thread, as suggested by JCIP.
> 
> 
>> All calls to the RpcDispatcher register with a Map, containing a 
>> reference to the Thread and the timeout before making the 
>> RpcDispatcher call
>> - JBC runs a separate "RpcTimeoutMonitor" thread which periodically 
>> checks threads making RpcDispatcher calls against their timeouts, 
>> interrupting those that have taken too long.
>> - The code calling the RpcDispatcher wrapped by a try block, 
>> attempting to catch interrupted exceptions, and throws a timeout 
>> exception to signify RPC timeout.
>>
>> The problem with this approach is the extra overhead of a 
>> RpcTimeoutMonitor thread.  The fact that the timeout will not be 100% 
>> accurate is not a problem - a "best effort" is good enough, so even if 
>> a call only times out after 2200 ms even though it was called with a 
>> timeout param of 2000 should not be of consequence.  At least calls 
>> don't get stuck, regardless of why or where in the RPC process it is 
>> held up.
>>
>> *** More inefficient JBC workaround
>>
>> - All calls to the RpcDispatcher happen in a separate thread, using a 
>> ThreadPoolExecutor
>> - The app thread then waits for timeout ms, and if the RPC call hasn't 
>> completed, throws a timeout exception - so even if the thread is stuck 
>> in, say, FC, at least JBC can roll back the tx and release locks, etc.
>>
>> Inefficient because each and every RPC call will happen in a separate 
>> thread + potential ugliness around orphaned threads stuck in a JGroups 
>> protocol.
>>
>> *** JGroups fix - I think this can be done easily, where any blocking 
>> operations in JGroups protocols make use of the timeout parameter.  
>> Again, this will not provide 100% timeout accuracy. but a "best 
>> effort", but like I said IMO this is ok.  (At the moment FC loops 
>> until it has enough creds.  I think this loop should timeout using the 
>> same timeout param.)  Now passing this param will involve a transient 
>> field in Event which RequestCorrelator could use to set the timeout.  
>> Protocols like FC can then use this timeout to determine how long it 
>> shuld loop for when waiting for creds.
>>
>> Thoughts?  My preferred option is the last one, since it gives the 
>> timeout param in the RpcDispatcher more meaning. 
> 
> No, that's a bad solution because the design of a flow control protocol 
> should not be influenced by an application level workaround. In 
> addition, if you have 500 threads, all timing out at the same time, you 
> will have a steady flow of messages, defeating the purpose of flow 
> control in the first place.
> 
> On top of that, we *cannot* do that because if we run on top of TCP, a 
> write() might block anyway if the TCP receiver set the sliding window to 
> 0 ! So the sending of data on top of TCP will block (similar to FC) when 
> TCP throttles the sender.
> 
> By the way, some decades ago the same issue of 'timed method calls' 
> occurred in CORBA, e.g. invoke foo() but it should take 450ms tops.
> 
> What *could* be done here is to add an option to RpcDispatcher to use 
> separate threads from a thread pool to dispatch requests and correlate 
> responses. So, the caller would create a task (which sends the request 
> and waits for all responses, possibly listening for cancellation), 
> submit it to the pool and get a future. Then wait on the future for N 
> milliseconds and return with the current results after that time, or 
> throw a timeout exception, whatever.
> 
> This issue cannot and should not be solved at the FC level, by 
> 'bypassing' flow control ! Note that, under normal circumstances, and 
> with 2.5, FC should never block for an extended time.
> 

-- 
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry at redhat.com