[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Tue Jun 19 07:32:29 EDT 2007

OK, yes, having remote puts and CCL'induced gets not block on the same 
lock is certainly a much better solution.

What will also help is JGroups 2.5 *should* we still encounter a deadlock.

BTW: I assume using the Multiplexer here is out of the question ? You 
could use a separate stack in the Multiplexer (e.g. "udp-repl-sync") 
which is only used by the CCL, and has no FC in it, and everyone else 
uses the async stack.

But then again, there was a reason we didn't configure the Multiplexer 
to be the defaut in 2.4...

Manik Surtani wrote:
>
> On 19 Jun 2007, at 09:04, Bela Ban wrote:
>
>> Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock 
>> refactoring, sounds like it needs to be done anyway. But, because you 
>> have to be thread safe at one point, aren't you deferring the issue 
>> to a later stage ? At one point you have to acquire a lock to prevent 
>> concurrent loading, and I'm afraid we'll have the same issue.
>>
>> But, wait, if you defer lock acquisition till after the synchronous 
>> cluster method call, you should be fine. Is this what you have in mind ?
>
> There are 2 locks here that need to be done.  One lock is on the Node, 
> via the lock interceptor, which happens *after* the cluster method 
> call.  This is not the problem.
>
> We also had a level of sync in the cache loader interceptor to prevent 
> the same Fqn being loaded twice from the CL (or a simultaneous load 
> and store, etc).  This bit was the problem, since this sync had to 
> happen *before* any loading was done by the CL.
>
> Regarding the other CL impls, some of them already were threadsafe 
> (BDBJE, Jdbm) and some weren't (JDBC, File). I created a StripedLock 
> class which uses a set of ReentrantReadWriteLocks applied to the Fqn 
> in question, and the JDBC and File CLs now use this to achieve thread 
> safety (more efficient than synchronized blocks and better mem usage 
> than one lock per Fqn - usual benefits of lock striping).
>
> You're right that the improvements in locking here are not quite 
> central to solving 1103, but somewhat related.
>
> The real solution to 1103 is in the CCL - remotely originating put()'s 
> shouldn't need a load (and hence not need to wait on the Fqn being 
> loaded) *only if* we're talking about the CCL, since CCLs don't 
> perform a corresponding store (they're read-only).  Therefore this 
> logic cannot be in the CLI as it doesn't apply to all CL impls.
>
>
>>
>> On the same topic: I looked into interrupting clustered method calls 
>> in JGroups (http://jira.jboss.com/jira/browse/JGRP-533), and came to 
>> the conclusion that it isn't feasible. When a thread is blocked on 
>> FC.down(), waiting for credits, then I can interrupt it, but it will 
>> go right back in its loop waiting for credits, therefore blocking 
>> again. Unless I change the loop condition (running & 
>> waiting-for-credits), the thread will always block if there aren't 
>> enough credits available.
>>
>> Thoughts ?
>>
>>
>> Manik Surtani wrote:
>>>
>>>
>>> Ok, I spent a bit of time investigating this, and the original sync 
>>> blocks in the CacheLoaderInterceptor, superceded by the 
>>> synchronization on an Fqn in subsequent versions are all hacks to 
>>> get around the fact that some cache loader implementations 
>>> themselves aren't thread safe.
>>>
>>> I wrote a test to check the thread safety of cache loader impls, and 
>>> most of what we have fail.  I patched the DummyInMemoryCacheLoader 
>>> to be thread safe (a lot of synchronization within the impl) and 
>>> removed the unnecessary locks in the interceptors and things work 
>>> fine.  I tried this with some of the few thread-safe cache loader 
>>> impls we have (BDBJE, for example) and this worked well as well.
>>>
>>> So what I propose is this:
>>>
>>> 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor)
>>> 2) Make sure we acquire locks (lock() call up the interceptor stack) 
>>> *before* attempting to load node, but after creating temp node
>>> 3) Make sure cache loader impls are thread safe
>>> 4) Add thread safety test to CacheLoaderTestBase
>>> 5) Patch loaders that aren't threadsafe and fail test in (4) - 
>>> FileCacheLoader, JDBCCacheLoader, etc.
>>>
>>> I'm going to do this on my local checkout of HEAD and run the 
>>> regression tests overnight.  If this works, I suggest backporting 
>>> this to 1.4.x and regressing there.
>>>
>>> WDYT?
>>
>> -- 
>> Bela Ban
>> Lead JGroups / JBoss Clustering team
>> JBoss - a division of Red Hat
>
> -- 
> Manik Surtani
>
> Lead, JBoss Cache
> JBoss, a division of Red Hat
>
>
>

-- 
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat