[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103
Brian Stansberry
brian.stansberry at redhat.com
Thu Jun 14 08:48:39 EDT 2007
This is http session repl; the overall cache is REPL_ASYNC, so FC is
needed. It is only the "clustered get" call made by the
ClusteredCacheLoader that makes a sync call. The CCL directly uses the
RpcDispatcher and uses a sync call since that is the semantic needed for
that particular call.
Background -- use of CCL is an experiment to try to get around problems
with initial state transfer with large states.
Good point about OOB in 2.5; that should prevent this situation. :)
Bela Ban wrote:
> Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we
> recommended to remove FC when making synchronous method calls.
>
> From 1103:
> 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so the
> CCL tries to do a clustered get. The CacheLoaderInterceptor lock for
> fqnA is held at this time.
> 2) Thread1 blocks in FC waiting for credits.
> 3) Replication message for fqnA arrives from node2.
> 4) IncomingPacketHandler thread will block waiting for the
> CacheLoaderInterceptor lock for fqnA.
> 5) FC credits cannot arrive, so we deadlock.
>
>
> Why can credit responses not arrive ? This is 2.5, not 2.4 right ? In
> 2.5, credit responses arrive as OOB messages, so they will always
> delivered, unless you disabled the OOB thread pool. Or is this 2.4 ?
> Then remove FC from the stack. Oh shit, just checked the case, this *is*
> 1.4.1 SP3 ! Yes, then remove FC from the stack; because you make
> synchronous method calls, this is not an issue as we don't need flow
> control in that case.
>
> Comments inline.
>
> Manik Surtani wrote:
>> Looking through this in detail, I see that the main problem is my
>> assumption that the timeout parameter in the JGroups RpcDispatcher's
>> callRemoteMethods() only starts ticking once the transport layer puts
>> the call on the wire. IMO this is misleading and wrong, and should
>> start ticking the moment I make an API call into the RpcDispatcher.
>> The very fact that I provide a timeout means that I shouldn't ever
>> expect the call to block forever no matter what the circumstance (a
>> bit like calling Object.wait() with no params vs calling it with a
>> timeout). Just had a chat with Jason to bounce a few ideas off him,
>> and here is what I can do in JBoss Cache to work around this (IMO an
>> ugly workaround):
>>
>> *** JBC workaround (does not work; just looked through FC code in HEAD
>> and it swallows InterruptedException - move on!)
>
> Where does FC swallow an InterruptedException ? The only place where I
> catch an exception is in handleDownMessage():
>
> catch(InterruptedException e) {
> // set the interrupted flag again, so the caller's thread can handle
> the interrupt as well
> Thread.currentThread().interrupt();
> }
>
>
> This does not swallow the exception; it rather passes it on to the
> calling thread, as suggested by JCIP.
>
>
>> All calls to the RpcDispatcher register with a Map, containing a
>> reference to the Thread and the timeout before making the
>> RpcDispatcher call
>> - JBC runs a separate "RpcTimeoutMonitor" thread which periodically
>> checks threads making RpcDispatcher calls against their timeouts,
>> interrupting those that have taken too long.
>> - The code calling the RpcDispatcher wrapped by a try block,
>> attempting to catch interrupted exceptions, and throws a timeout
>> exception to signify RPC timeout.
>>
>> The problem with this approach is the extra overhead of a
>> RpcTimeoutMonitor thread. The fact that the timeout will not be 100%
>> accurate is not a problem - a "best effort" is good enough, so even if
>> a call only times out after 2200 ms even though it was called with a
>> timeout param of 2000 should not be of consequence. At least calls
>> don't get stuck, regardless of why or where in the RPC process it is
>> held up.
>>
>> *** More inefficient JBC workaround
>>
>> - All calls to the RpcDispatcher happen in a separate thread, using a
>> ThreadPoolExecutor
>> - The app thread then waits for timeout ms, and if the RPC call hasn't
>> completed, throws a timeout exception - so even if the thread is stuck
>> in, say, FC, at least JBC can roll back the tx and release locks, etc.
>>
>> Inefficient because each and every RPC call will happen in a separate
>> thread + potential ugliness around orphaned threads stuck in a JGroups
>> protocol.
>>
>> *** JGroups fix - I think this can be done easily, where any blocking
>> operations in JGroups protocols make use of the timeout parameter.
>> Again, this will not provide 100% timeout accuracy. but a "best
>> effort", but like I said IMO this is ok. (At the moment FC loops
>> until it has enough creds. I think this loop should timeout using the
>> same timeout param.) Now passing this param will involve a transient
>> field in Event which RequestCorrelator could use to set the timeout.
>> Protocols like FC can then use this timeout to determine how long it
>> shuld loop for when waiting for creds.
>>
>> Thoughts? My preferred option is the last one, since it gives the
>> timeout param in the RpcDispatcher more meaning.
>
> No, that's a bad solution because the design of a flow control protocol
> should not be influenced by an application level workaround. In
> addition, if you have 500 threads, all timing out at the same time, you
> will have a steady flow of messages, defeating the purpose of flow
> control in the first place.
>
> On top of that, we *cannot* do that because if we run on top of TCP, a
> write() might block anyway if the TCP receiver set the sliding window to
> 0 ! So the sending of data on top of TCP will block (similar to FC) when
> TCP throttles the sender.
>
> By the way, some decades ago the same issue of 'timed method calls'
> occurred in CORBA, e.g. invoke foo() but it should take 450ms tops.
>
> What *could* be done here is to add an option to RpcDispatcher to use
> separate threads from a thread pool to dispatch requests and correlate
> responses. So, the caller would create a task (which sends the request
> and waits for all responses, possibly listening for cancellation),
> submit it to the pool and get a future. Then wait on the future for N
> milliseconds and return with the current results after that time, or
> throw a timeout exception, whatever.
>
> This issue cannot and should not be solved at the FC level, by
> 'bypassing' flow control ! Note that, under normal circumstances, and
> with 2.5, FC should never block for an extended time.
>
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry at redhat.com
More information about the jbosscache-dev
mailing list