[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103
Manik Surtani
manik at jboss.org
Thu Jun 14 12:39:15 EDT 2007
Ok, so I think we have a solution here.
Bela's created http://jira.jboss.com/jira/browse/JGRP-533 (see task
for details on this)
Once this is done (JGroups 2.4.1.SP4) I can fix JBCACHE-1103 by
patching the CCL to make use of this JGroups fix. This will mean a
new release for JBC (1.4.1.SP4)
Cheers,
Manik
On 14 Jun 2007, at 13:48, Brian Stansberry wrote:
> This is http session repl; the overall cache is REPL_ASYNC, so FC
> is needed. It is only the "clustered get" call made by the
> ClusteredCacheLoader that makes a sync call. The CCL directly uses
> the RpcDispatcher and uses a sync call since that is the semantic
> needed for that particular call.
>
> Background -- use of CCL is an experiment to try to get around
> problems with initial state transfer with large states.
>
> Good point about OOB in 2.5; that should prevent this situation. :)
>
> Bela Ban wrote:
>> Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we
>> recommended to remove FC when making synchronous method calls.
>> From 1103:
>> 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so
>> the CCL tries to do a clustered get. The CacheLoaderInterceptor
>> lock for fqnA is held at this time.
>> 2) Thread1 blocks in FC waiting for credits.
>> 3) Replication message for fqnA arrives from node2.
>> 4) IncomingPacketHandler thread will block waiting for the
>> CacheLoaderInterceptor lock for fqnA.
>> 5) FC credits cannot arrive, so we deadlock.
>> Why can credit responses not arrive ? This is 2.5, not 2.4 right ?
>> In 2.5, credit responses arrive as OOB messages, so they will
>> always delivered, unless you disabled the OOB thread pool. Or is
>> this 2.4 ? Then remove FC from the stack. Oh shit, just checked
>> the case, this *is* 1.4.1 SP3 ! Yes, then remove FC from the
>> stack; because you make synchronous method calls, this is not an
>> issue as we don't need flow control in that case.
>> Comments inline.
>> Manik Surtani wrote:
>>> Looking through this in detail, I see that the main problem is my
>>> assumption that the timeout parameter in the JGroups
>>> RpcDispatcher's callRemoteMethods() only starts ticking once the
>>> transport layer puts the call on the wire. IMO this is
>>> misleading and wrong, and should start ticking the moment I make
>>> an API call into the RpcDispatcher. The very fact that I provide
>>> a timeout means that I shouldn't ever expect the call to block
>>> forever no matter what the circumstance (a bit like calling
>>> Object.wait() with no params vs calling it with a timeout). Just
>>> had a chat with Jason to bounce a few ideas off him, and here is
>>> what I can do in JBoss Cache to work around this (IMO an ugly
>>> workaround):
>>>
>>> *** JBC workaround (does not work; just looked through FC code in
>>> HEAD and it swallows InterruptedException - move on!)
>> Where does FC swallow an InterruptedException ? The only place
>> where I catch an exception is in handleDownMessage():
>> catch(InterruptedException e) {
>> // set the interrupted flag again, so the caller's thread can
>> handle the interrupt as well
>> Thread.currentThread().interrupt();
>> }
>> This does not swallow the exception; it rather passes it on to the
>> calling thread, as suggested by JCIP.
>>> All calls to the RpcDispatcher register with a Map, containing a
>>> reference to the Thread and the timeout before making the
>>> RpcDispatcher call
>>> - JBC runs a separate "RpcTimeoutMonitor" thread which
>>> periodically checks threads making RpcDispatcher calls against
>>> their timeouts, interrupting those that have taken too long.
>>> - The code calling the RpcDispatcher wrapped by a try block,
>>> attempting to catch interrupted exceptions, and throws a timeout
>>> exception to signify RPC timeout.
>>>
>>> The problem with this approach is the extra overhead of a
>>> RpcTimeoutMonitor thread. The fact that the timeout will not be
>>> 100% accurate is not a problem - a "best effort" is good enough,
>>> so even if a call only times out after 2200 ms even though it was
>>> called with a timeout param of 2000 should not be of
>>> consequence. At least calls don't get stuck, regardless of why
>>> or where in the RPC process it is held up.
>>>
>>> *** More inefficient JBC workaround
>>>
>>> - All calls to the RpcDispatcher happen in a separate thread,
>>> using a ThreadPoolExecutor
>>> - The app thread then waits for timeout ms, and if the RPC call
>>> hasn't completed, throws a timeout exception - so even if the
>>> thread is stuck in, say, FC, at least JBC can roll back the tx
>>> and release locks, etc.
>>>
>>> Inefficient because each and every RPC call will happen in a
>>> separate thread + potential ugliness around orphaned threads
>>> stuck in a JGroups protocol.
>>>
>>> *** JGroups fix - I think this can be done easily, where any
>>> blocking operations in JGroups protocols make use of the timeout
>>> parameter. Again, this will not provide 100% timeout accuracy.
>>> but a "best effort", but like I said IMO this is ok. (At the
>>> moment FC loops until it has enough creds. I think this loop
>>> should timeout using the same timeout param.) Now passing this
>>> param will involve a transient field in Event which
>>> RequestCorrelator could use to set the timeout. Protocols like
>>> FC can then use this timeout to determine how long it shuld loop
>>> for when waiting for creds.
>>>
>>> Thoughts? My preferred option is the last one, since it gives
>>> the timeout param in the RpcDispatcher more meaning.
>> No, that's a bad solution because the design of a flow control
>> protocol should not be influenced by an application level
>> workaround. In addition, if you have 500 threads, all timing out
>> at the same time, you will have a steady flow of messages,
>> defeating the purpose of flow control in the first place.
>> On top of that, we *cannot* do that because if we run on top of
>> TCP, a write() might block anyway if the TCP receiver set the
>> sliding window to 0 ! So the sending of data on top of TCP will
>> block (similar to FC) when TCP throttles the sender.
>> By the way, some decades ago the same issue of 'timed method
>> calls' occurred in CORBA, e.g. invoke foo() but it should take
>> 450ms tops.
>> What *could* be done here is to add an option to RpcDispatcher to
>> use separate threads from a thread pool to dispatch requests and
>> correlate responses. So, the caller would create a task (which
>> sends the request and waits for all responses, possibly listening
>> for cancellation), submit it to the pool and get a future. Then
>> wait on the future for N milliseconds and return with the current
>> results after that time, or throw a timeout exception, whatever.
>> This issue cannot and should not be solved at the FC level, by
>> 'bypassing' flow control ! Note that, under normal circumstances,
>> and with 2.5, FC should never block for an extended time.
>
>
> --
> Brian Stansberry
> Lead, AS Clustering
> JBoss, a division of Red Hat
> brian.stansberry at redhat.com
>
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
More information about the jbosscache-dev
mailing list