ClusteredCacheLoader deadlocks and JBCCACHE-1103
by Manik Surtani
Looking through this in detail, I see that the main problem is my
assumption that the timeout parameter in the JGroups RpcDispatcher's
callRemoteMethods() only starts ticking once the transport layer puts
the call on the wire. IMO this is misleading and wrong, and should
start ticking the moment I make an API call into the RpcDispatcher.
The very fact that I provide a timeout means that I shouldn't ever
expect the call to block forever no matter what the circumstance (a
bit like calling Object.wait() with no params vs calling it with a
timeout). Just had a chat with Jason to bounce a few ideas off him,
and here is what I can do in JBoss Cache to work around this (IMO an
ugly workaround):
*** JBC workaround (does not work; just looked through FC code in
HEAD and it swallows InterruptedException - move on!)
- All calls to the RpcDispatcher register with a Map, containing a
reference to the Thread and the timeout before making the
RpcDispatcher call
- JBC runs a separate "RpcTimeoutMonitor" thread which periodically
checks threads making RpcDispatcher calls against their timeouts,
interrupting those that have taken too long.
- The code calling the RpcDispatcher wrapped by a try block,
attempting to catch interrupted exceptions, and throws a timeout
exception to signify RPC timeout.
The problem with this approach is the extra overhead of a
RpcTimeoutMonitor thread. The fact that the timeout will not be 100%
accurate is not a problem - a "best effort" is good enough, so even
if a call only times out after 2200 ms even though it was called with
a timeout param of 2000 should not be of consequence. At least calls
don't get stuck, regardless of why or where in the RPC process it is
held up.
*** More inefficient JBC workaround
- All calls to the RpcDispatcher happen in a separate thread, using a
ThreadPoolExecutor
- The app thread then waits for timeout ms, and if the RPC call
hasn't completed, throws a timeout exception - so even if the thread
is stuck in, say, FC, at least JBC can roll back the tx and release
locks, etc.
Inefficient because each and every RPC call will happen in a separate
thread + potential ugliness around orphaned threads stuck in a
JGroups protocol.
*** JGroups fix - I think this can be done easily, where any blocking
operations in JGroups protocols make use of the timeout parameter.
Again, this will not provide 100% timeout accuracy. but a "best
effort", but like I said IMO this is ok. (At the moment FC loops
until it has enough creds. I think this loop should timeout using
the same timeout param.) Now passing this param will involve a
transient field in Event which RequestCorrelator could use to set the
timeout. Protocols like FC can then use this timeout to determine
how long it shuld loop for when waiting for creds.
Thoughts? My preferred option is the last one, since it gives the
timeout param in the RpcDispatcher more meaning.
Cheers,
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
17 years, 6 months