Re: [jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Thursday, 14 June 2007

Ok, so I think we have a solution here.

Bela's created http://jira.jboss.com/jira/browse/JGRP-533 (see task  
for details on this)

Once this is done (JGroups 2.4.1.SP4) I can fix JBCACHE-1103 by  
patching the CCL to make use of this JGroups fix.  This will mean a  
new release for JBC (1.4.1.SP4)

Cheers,
Manik

On 14 Jun 2007, at 13:48, Brian Stansberry wrote:

...
 This is http session repl; the overall cache is REPL_ASYNC, so FC  
 is needed.  It is only the "clustered get" call made by the  
 ClusteredCacheLoader that makes a sync call.  The CCL directly uses  
 the RpcDispatcher and uses a sync call since that is the semantic  
 needed for that particular call.

 Background -- use of CCL is an experiment to try to get around  
 problems with initial state transfer with large states.

 Good point about OOB in 2.5; that should prevent this situation. :)

 Bela Ban wrote:
> Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we  
> recommended to remove FC when making synchronous method calls.
>  From 1103:
> 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so  
> the CCL tries to do a clustered get. The CacheLoaderInterceptor  
> lock for fqnA is held at this time.
> 2) Thread1 blocks in FC waiting for credits.
> 3) Replication message for fqnA arrives from node2.
> 4) IncomingPacketHandler thread will block waiting for the  
> CacheLoaderInterceptor lock for fqnA.
> 5) FC credits cannot arrive, so we deadlock.
> Why can credit responses not arrive ? This is 2.5, not 2.4 right ?  
> In 2.5, credit responses arrive as OOB messages, so they will  
> always delivered, unless you disabled the OOB thread pool. Or is  
> this 2.4 ? Then remove FC from the stack. Oh shit, just checked  
> the case, this *is* 1.4.1 SP3 ! Yes, then remove FC from the  
> stack; because you make synchronous method calls, this is not an  
> issue as we don't need flow control in that case.
> Comments inline.
> Manik Surtani wrote:
>> Looking through this in detail, I see that the main problem is my  
>> assumption that the timeout parameter in the JGroups  
>> RpcDispatcher's callRemoteMethods() only starts ticking once the  
>> transport layer puts the call on the wire.  IMO this is  
>> misleading and wrong, and should start ticking the moment I make  
>> an API call into the RpcDispatcher.  The very fact that I provide  
>> a timeout means that I shouldn't ever expect the call to block  
>> forever no matter what the circumstance (a bit like calling  
>> Object.wait() with no params vs calling it with a timeout).  Just  
>> had a chat with Jason to bounce a few ideas off him, and here is  
>> what I can do in JBoss Cache to work around this (IMO an ugly  
>> workaround):
>>
>> *** JBC workaround (does not work; just looked through FC code in  
>> HEAD and it swallows InterruptedException - move on!)
> Where does FC swallow an InterruptedException ? The only place  
> where I catch an exception is in handleDownMessage():
> catch(InterruptedException e) {
>     // set the interrupted flag again, so the caller's thread can  
> handle the interrupt as well
>     Thread.currentThread().interrupt();
> }
> This does not swallow the exception; it rather passes it on to the  
> calling thread, as suggested by JCIP.
>> All calls to the RpcDispatcher register with a Map, containing a  
>> reference to the Thread and the timeout before making the  
>> RpcDispatcher call
>> - JBC runs a separate "RpcTimeoutMonitor" thread which  
>> periodically checks threads making RpcDispatcher calls against  
>> their timeouts, interrupting those that have taken too long.
>> - The code calling the RpcDispatcher wrapped by a try block,  
>> attempting to catch interrupted exceptions, and throws a timeout  
>> exception to signify RPC timeout.
>>
>> The problem with this approach is the extra overhead of a  
>> RpcTimeoutMonitor thread.  The fact that the timeout will not be  
>> 100% accurate is not a problem - a "best effort" is good enough,  
>> so even if a call only times out after 2200 ms even though it was  
>> called with a timeout param of 2000 should not be of  
>> consequence.  At least calls don't get stuck, regardless of why  
>> or where in the RPC process it is held up.
>>
>> *** More inefficient JBC workaround
>>
>> - All calls to the RpcDispatcher happen in a separate thread,  
>> using a ThreadPoolExecutor
>> - The app thread then waits for timeout ms, and if the RPC call  
>> hasn't completed, throws a timeout exception - so even if the  
>> thread is stuck in, say, FC, at least JBC can roll back the tx  
>> and release locks, etc.
>>
>> Inefficient because each and every RPC call will happen in a  
>> separate thread + potential ugliness around orphaned threads  
>> stuck in a JGroups protocol.
>>
>> *** JGroups fix - I think this can be done easily, where any  
>> blocking operations in JGroups protocols make use of the timeout  
>> parameter.  Again, this will not provide 100% timeout accuracy.  
>> but a "best effort", but like I said IMO this is ok.  (At the  
>> moment FC loops until it has enough creds.  I think this loop  
>> should timeout using the same timeout param.)  Now passing this  
>> param will involve a transient field in Event which  
>> RequestCorrelator could use to set the timeout.  Protocols like  
>> FC can then use this timeout to determine how long it shuld loop  
>> for when waiting for creds.
>>
>> Thoughts?  My preferred option is the last one, since it gives  
>> the timeout param in the RpcDispatcher more meaning.
> No, that's a bad solution because the design of a flow control  
> protocol should not be influenced by an application level  
> workaround. In addition, if you have 500 threads, all timing out  
> at the same time, you will have a steady flow of messages,  
> defeating the purpose of flow control in the first place.
> On top of that, we *cannot* do that because if we run on top of  
> TCP, a write() might block anyway if the TCP receiver set the  
> sliding window to 0 ! So the sending of data on top of TCP will  
> block (similar to FC) when TCP throttles the sender.
> By the way, some decades ago the same issue of 'timed method  
> calls' occurred in CORBA, e.g. invoke foo() but it should take  
> 450ms tops.
> What *could* be done here is to add an option to RpcDispatcher to  
> use separate threads from a thread pool to dispatch requests and  
> correlate responses. So, the caller would create a task (which  
> sends the request and waits for all responses, possibly listening  
> for cancellation), submit it to the pool and get a future. Then  
> wait on the future for N milliseconds and return with the current  
> results after that time, or throw a timeout exception, whatever.
> This issue cannot and should not be solved at the FC level, by  
> 'bypassing' flow control ! Note that, under normal circumstances,  
> and with 2.5, FC should never block for an extended time.

 -- 
 Brian Stansberry
 Lead, AS Clustering
 JBoss, a division of Red Hat
 brian.stansberry(a)redhat.com

--
Manik Surtani

Lead, JBoss Cache
JBoss, a division of Red Hat

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103