[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Mon Jun 18 19:08:01 EDT 2007

On 18 Jun 2007, at 11:48, Manik Surtani wrote:

>
> On 15 Jun 2007, at 13:09, Bela Ban wrote:
>
>>
>>
>> Jason T. Greene wrote:
>>> That should be ok though because the CCL will still timeout on that
>>> lock. The original problem was that the same thread with the CCL  
>>> lock
>>> was blocking on an FC lock so that the CCL lock would never be  
>>> released
>>> (since the FC lock was higher in the stack).
>>
>> See my previous reply. Yes, it blocked on FC.down() because it  
>> didn't receive credits in up(). But up() wasn't called because  
>> there was a replication message ahead of it in the queue that  
>> blocked on the FQN held by the CCL.
>>
>> So to tackle this, my suggestion were, in this order:
>> #1 Don't hold a lock while making a synchronous cluster method  
>> call. That's a big no no, especially in pre-2.5 releases. We had  
>> lots of bugs in the clustering code due to such code. Then Brian  
>> cleaned up all of it... :-)
>> #2 The timeout mechanism in JGroups which uses threads. Ugly, and  
>> a hack, and only needed for 2.4. As I argued, this will avoid the  
>> deadlock, but it will constantly time out (assuming some traffic).
>>
>> The root cause of this is #1
>
> Let me look into why we had #1 anyway.  Originally the 1.2.x  
> codebase used a synchronized block on the CacheLoaderInterceptor  
> for this which meant that only one thread could pass through this  
> interceptor at any given time.  I changed this to lock on the Fqn  
> in question so at least if the Fqns didn't overlap multiple threads  
> could go thru this interceptor.
>
> The reason behind it seems to be so that the CacheLoader impl does  
> not have to deal with concurrent calls on the same node, but  
> thinking about it, I feel this is something that should be handled  
> in each CacheLoader impl, which should be thread safe.

Ok, I spent a bit of time investigating this, and the original sync  
blocks in the CacheLoaderInterceptor, superceded by the  
synchronization on an Fqn in subsequent versions are all hacks to get  
around the fact that some cache loader implementations themselves  
aren't thread safe.

I wrote a test to check the thread safety of cache loader impls, and  
most of what we have fail.  I patched the DummyInMemoryCacheLoader to  
be thread safe (a lot of synchronization within the impl) and removed  
the unnecessary locks in the interceptors and things work fine.  I  
tried this with some of the few thread-safe cache loader impls we  
have (BDBJE, for example) and this worked well as well.

So what I propose is this:

1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor)
2) Make sure we acquire locks (lock() call up the interceptor stack)  
*before* attempting to load node, but after creating temp node
3) Make sure cache loader impls are thread safe
4) Add thread safety test to CacheLoaderTestBase
5) Patch loaders that aren't threadsafe and fail test in (4) -  
FileCacheLoader, JDBCCacheLoader, etc.

I'm going to do this on my local checkout of HEAD and run the  
regression tests overnight.  If this works, I suggest backporting  
this to 1.4.x and regressing there.

WDYT?

Cheers,
--
Manik Surtani

Lead, JBoss Cache
JBoss, a division of Red Hat