Is this with JGroups 2.5 or 2.4 ? In versions prior to 2.5, we
recommended to remove FC when making synchronous method calls.
From 1103:
1) Thread1 on node1 is trying to get(fqnA) which is not in-VM, so the
CCL tries to do a clustered get. The CacheLoaderInterceptor lock for
fqnA is held at this time.
2) Thread1 blocks in FC waiting for credits.
3) Replication message for fqnA arrives from node2.
4) IncomingPacketHandler thread will block waiting for the
CacheLoaderInterceptor lock for fqnA.
5) FC credits cannot arrive, so we deadlock.
Why can credit responses not arrive ? This is 2.5, not 2.4 right ? In
2.5, credit responses arrive as OOB messages, so they will always
delivered, unless you disabled the OOB thread pool. Or is this 2.4 ?
Then remove FC from the stack. Oh shit, just checked the case, this *is*
1.4.1 SP3 ! Yes, then remove FC from the stack; because you make
synchronous method calls, this is not an issue as we don't need flow
control in that case.
Comments inline.
Manik Surtani wrote:
Looking through this in detail, I see that the main problem is my
assumption that the timeout parameter in the JGroups RpcDispatcher's
callRemoteMethods() only starts ticking once the transport layer puts
the call on the wire. IMO this is misleading and wrong, and should
start ticking the moment I make an API call into the RpcDispatcher.
The very fact that I provide a timeout means that I shouldn't ever
expect the call to block forever no matter what the circumstance (a
bit like calling Object.wait() with no params vs calling it with a
timeout). Just had a chat with Jason to bounce a few ideas off him,
and here is what I can do in JBoss Cache to work around this (IMO an
ugly workaround):
*** JBC workaround (does not work; just looked through FC code in HEAD
and it swallows InterruptedException - move on!)
Where does FC swallow an InterruptedException ? The only place where I
catch an exception is in handleDownMessage():
catch(InterruptedException e) {
// set the interrupted flag again, so the caller's thread can
handle the interrupt as well
Thread.currentThread().interrupt();
}
This does not swallow the exception; it rather passes it on to the
calling thread, as suggested by JCIP.
All calls to the RpcDispatcher register with a Map, containing a
reference to the Thread and the timeout before making the
RpcDispatcher call
- JBC runs a separate "RpcTimeoutMonitor" thread which periodically
checks threads making RpcDispatcher calls against their timeouts,
interrupting those that have taken too long.
- The code calling the RpcDispatcher wrapped by a try block,
attempting to catch interrupted exceptions, and throws a timeout
exception to signify RPC timeout.
The problem with this approach is the extra overhead of a
RpcTimeoutMonitor thread. The fact that the timeout will not be 100%
accurate is not a problem - a "best effort" is good enough, so even if
a call only times out after 2200 ms even though it was called with a
timeout param of 2000 should not be of consequence. At least calls
don't get stuck, regardless of why or where in the RPC process it is
held up.
*** More inefficient JBC workaround
- All calls to the RpcDispatcher happen in a separate thread, using a
ThreadPoolExecutor
- The app thread then waits for timeout ms, and if the RPC call hasn't
completed, throws a timeout exception - so even if the thread is stuck
in, say, FC, at least JBC can roll back the tx and release locks, etc.
Inefficient because each and every RPC call will happen in a separate
thread + potential ugliness around orphaned threads stuck in a JGroups
protocol.
*** JGroups fix - I think this can be done easily, where any blocking
operations in JGroups protocols make use of the timeout parameter.
Again, this will not provide 100% timeout accuracy. but a "best
effort", but like I said IMO this is ok. (At the moment FC loops
until it has enough creds. I think this loop should timeout using the
same timeout param.) Now passing this param will involve a transient
field in Event which RequestCorrelator could use to set the timeout.
Protocols like FC can then use this timeout to determine how long it
shuld loop for when waiting for creds.
Thoughts? My preferred option is the last one, since it gives the
timeout param in the RpcDispatcher more meaning.
No, that's a bad solution because the design of a flow control protocol
should not be influenced by an application level workaround. In
addition, if you have 500 threads, all timing out at the same time, you
will have a steady flow of messages, defeating the purpose of flow
control in the first place.
On top of that, we *cannot* do that because if we run on top of TCP, a
write() might block anyway if the TCP receiver set the sliding window to
0 ! So the sending of data on top of TCP will block (similar to FC) when
TCP throttles the sender.
By the way, some decades ago the same issue of 'timed method calls'
occurred in CORBA, e.g. invoke foo() but it should take 450ms tops.
What *could* be done here is to add an option to RpcDispatcher to use
separate threads from a thread pool to dispatch requests and correlate
responses. So, the caller would create a task (which sends the request
and waits for all responses, possibly listening for cancellation),
submit it to the pool and get a future. Then wait on the future for N
milliseconds and return with the current results after that time, or
throw a timeout exception, whatever.
This issue cannot and should not be solved at the FC level, by
'bypassing' flow control ! Note that, under normal circumstances, and
with 2.5, FC should never block for an extended time.
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat