On 15 Jun 2007, at 13:09, Bela Ban wrote:
Jason T. Greene wrote:
> That should be ok though because the CCL will still timeout on that
> lock. The original problem was that the same thread with the CCL lock
> was blocking on an FC lock so that the CCL lock would never be
> released
> (since the FC lock was higher in the stack).
See my previous reply. Yes, it blocked on FC.down() because it
didn't receive credits in up(). But up() wasn't called because
there was a replication message ahead of it in the queue that
blocked on the FQN held by the CCL.
So to tackle this, my suggestion were, in this order:
#1 Don't hold a lock while making a synchronous cluster method
call. That's a big no no, especially in pre-2.5 releases. We had
lots of bugs in the clustering code due to such code. Then Brian
cleaned up all of it... :-)
#2 The timeout mechanism in JGroups which uses threads. Ugly, and a
hack, and only needed for 2.4. As I argued, this will avoid the
deadlock, but it will constantly time out (assuming some traffic).
The root cause of this is #1
Let me look into why we had #1 anyway. Originally the 1.2.x codebase
used a synchronized block on the CacheLoaderInterceptor for this
which meant that only one thread could pass through this interceptor
at any given time. I changed this to lock on the Fqn in question so
at least if the Fqns didn't overlap multiple threads could go thru
this interceptor.
The reason behind it seems to be so that the CacheLoader impl does
not have to deal with concurrent calls on the same node, but thinking
about it, I feel this is something that should be handled in each
CacheLoader impl, which should be thread safe.