Ok, so I think we have a solution here.
Bela's created
http://jira.jboss.com/jira/browse/JGRP-533 (see task
for details on this)
Once this is done (JGroups 2.4.1.SP4) I can fix JBCACHE-1103 by
patching the CCL to make use of this JGroups fix. This will mean a
new release for JBC (1.4.1.SP4)
Cheers,
Manik
On 14 Jun 2007, at 13:48, Brian Stansberry wrote:
This is http session repl; the overall cache is REPL_ASYNC, so FC
is needed. It is only the "clustered get" call made by the
ClusteredCacheLoader that makes a sync call. The CCL directly uses
the RpcDispatcher and uses a sync call since that is the semantic
needed for that particular call.
Background -- use of CCL is an experiment to try to get around
problems with initial state transfer with large states.
Good point about OOB in 2.5; that should prevent this situation. :)
Bela Ban wrote:
> Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we
> recommended to remove FC when making synchronous method calls.
> From 1103:
> 1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so
> the CCL tries to do a clustered get. The CacheLoaderInterceptor
> lock for fqnA is held at this time.
> 2) Thread1 blocks in FC waiting for credits.
> 3) Replication message for fqnA arrives from node2.
> 4) IncomingPacketHandler thread will block waiting for the
> CacheLoaderInterceptor lock for fqnA.
> 5) FC credits cannot arrive, so we deadlock.
> Why can credit responses not arrive ? This is 2.5, not 2.4 right ?
> In 2.5, credit responses arrive as OOB messages, so they will
> always delivered, unless you disabled the OOB thread pool. Or is
> this 2.4 ? Then remove FC from the stack. Oh shit, just checked
> the case, this *is* 1.4.1 SP3 ! Yes, then remove FC from the
> stack; because you make synchronous method calls, this is not an
> issue as we don't need flow control in that case.
> Comments inline.
> Manik Surtani wrote:
>> Looking through this in detail, I see that the main problem is my
>> assumption that the timeout parameter in the JGroups
>> RpcDispatcher's callRemoteMethods() only starts ticking once the
>> transport layer puts the call on the wire. IMO this is
>> misleading and wrong, and should start ticking the moment I make
>> an API call into the RpcDispatcher. The very fact that I provide
>> a timeout means that I shouldn't ever expect the call to block
>> forever no matter what the circumstance (a bit like calling
>> Object.wait() with no params vs calling it with a timeout). Just
>> had a chat with Jason to bounce a few ideas off him, and here is
>> what I can do in JBoss Cache to work around this (IMO an ugly
>> workaround):
>>
>> *** JBC workaround (does not work; just looked through FC code in
>> HEAD and it swallows InterruptedException - move on!)
> Where does FC swallow an InterruptedException ? The only place
> where I catch an exception is in handleDownMessage():
> catch(InterruptedException e) {
> // set the interrupted flag again, so the caller's thread can
> handle the interrupt as well
> Thread.currentThread().interrupt();
> }
> This does not swallow the exception; it rather passes it on to the
> calling thread, as suggested by JCIP.
>> All calls to the RpcDispatcher register with a Map, containing a
>> reference to the Thread and the timeout before making the
>> RpcDispatcher call
>> - JBC runs a separate "RpcTimeoutMonitor" thread which
>> periodically checks threads making RpcDispatcher calls against
>> their timeouts, interrupting those that have taken too long.
>> - The code calling the RpcDispatcher wrapped by a try block,
>> attempting to catch interrupted exceptions, and throws a timeout
>> exception to signify RPC timeout.
>>
>> The problem with this approach is the extra overhead of a
>> RpcTimeoutMonitor thread. The fact that the timeout will not be
>> 100% accurate is not a problem - a "best effort" is good enough,
>> so even if a call only times out after 2200 ms even though it was
>> called with a timeout param of 2000 should not be of
>> consequence. At least calls don't get stuck, regardless of why
>> or where in the RPC process it is held up.
>>
>> *** More inefficient JBC workaround
>>
>> - All calls to the RpcDispatcher happen in a separate thread,
>> using a ThreadPoolExecutor
>> - The app thread then waits for timeout ms, and if the RPC call
>> hasn't completed, throws a timeout exception - so even if the
>> thread is stuck in, say, FC, at least JBC can roll back the tx
>> and release locks, etc.
>>
>> Inefficient because each and every RPC call will happen in a
>> separate thread + potential ugliness around orphaned threads
>> stuck in a JGroups protocol.
>>
>> *** JGroups fix - I think this can be done easily, where any
>> blocking operations in JGroups protocols make use of the timeout
>> parameter. Again, this will not provide 100% timeout accuracy.
>> but a "best effort", but like I said IMO this is ok. (At the
>> moment FC loops until it has enough creds. I think this loop
>> should timeout using the same timeout param.) Now passing this
>> param will involve a transient field in Event which
>> RequestCorrelator could use to set the timeout. Protocols like
>> FC can then use this timeout to determine how long it shuld loop
>> for when waiting for creds.
>>
>> Thoughts? My preferred option is the last one, since it gives
>> the timeout param in the RpcDispatcher more meaning.
> No, that's a bad solution because the design of a flow control
> protocol should not be influenced by an application level
> workaround. In addition, if you have 500 threads, all timing out
> at the same time, you will have a steady flow of messages,
> defeating the purpose of flow control in the first place.
> On top of that, we *cannot* do that because if we run on top of
> TCP, a write() might block anyway if the TCP receiver set the
> sliding window to 0 ! So the sending of data on top of TCP will
> block (similar to FC) when TCP throttles the sender.
> By the way, some decades ago the same issue of 'timed method
> calls' occurred in CORBA, e.g. invoke foo() but it should take
> 450ms tops.
> What *could* be done here is to add an option to RpcDispatcher to
> use separate threads from a thread pool to dispatch requests and
> correlate responses. So, the caller would create a task (which
> sends the request and waits for all responses, possibly listening
> for cancellation), submit it to the pool and get a future. Then
> wait on the future for N milliseconds and return with the current
> results after that time, or throw a timeout exception, whatever.
> This issue cannot and should not be solved at the FC level, by
> 'bypassing' flow control ! Note that, under normal circumstances,
> and with 2.5, FC should never block for an extended time.
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat