[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Tue Jun 19 10:28:44 EDT 2007

Scratch what I said abt the Bdbje CL - it gets it's knickers in a  
twist as well with enough CPUs/cores - the 4-cpu lab servers got the  
BDBJE engine in a deadlock with concurrent reads, writes and  
removes.  Will have to apply the same StripedLock access check to  
this as well.

On 19 Jun 2007, at 12:32, Bela Ban wrote:

> OK, yes, having remote puts and CCL'induced gets not block on the  
> same lock is certainly a much better solution.
>
> What will also help is JGroups 2.5 *should* we still encounter a  
> deadlock.
>
> BTW: I assume using the Multiplexer here is out of the question ?  
> You could use a separate stack in the Multiplexer (e.g. "udp-repl- 
> sync") which is only used by the CCL, and has no FC in it, and  
> everyone else uses the async stack.
>
> But then again, there was a reason we didn't configure the  
> Multiplexer to be the defaut in 2.4...
>
> Manik Surtani wrote:
>>
>> On 19 Jun 2007, at 09:04, Bela Ban wrote:
>>
>>> Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock  
>>> refactoring, sounds like it needs to be done anyway. But, because  
>>> you have to be thread safe at one point, aren't you deferring the  
>>> issue to a later stage ? At one point you have to acquire a lock  
>>> to prevent concurrent loading, and I'm afraid we'll have the same  
>>> issue.
>>>
>>> But, wait, if you defer lock acquisition till after the  
>>> synchronous cluster method call, you should be fine. Is this what  
>>> you have in mind ?
>>
>> There are 2 locks here that need to be done.  One lock is on the  
>> Node, via the lock interceptor, which happens *after* the cluster  
>> method call.  This is not the problem.
>>
>> We also had a level of sync in the cache loader interceptor to  
>> prevent the same Fqn being loaded twice from the CL (or a  
>> simultaneous load and store, etc).  This bit was the problem,  
>> since this sync had to happen *before* any loading was done by the  
>> CL.
>>
>> Regarding the other CL impls, some of them already were threadsafe  
>> (BDBJE, Jdbm) and some weren't (JDBC, File). I created a  
>> StripedLock class which uses a set of ReentrantReadWriteLocks  
>> applied to the Fqn in question, and the JDBC and File CLs now use  
>> this to achieve thread safety (more efficient than synchronized  
>> blocks and better mem usage than one lock per Fqn - usual benefits  
>> of lock striping).
>>
>> You're right that the improvements in locking here are not quite  
>> central to solving 1103, but somewhat related.
>>
>> The real solution to 1103 is in the CCL - remotely originating put 
>> ()'s shouldn't need a load (and hence not need to wait on the Fqn  
>> being loaded) *only if* we're talking about the CCL, since CCLs  
>> don't perform a corresponding store (they're read-only).   
>> Therefore this logic cannot be in the CLI as it doesn't apply to  
>> all CL impls.
>>
>>
>>>
>>> On the same topic: I looked into interrupting clustered method  
>>> calls in JGroups (http://jira.jboss.com/jira/browse/JGRP-533),  
>>> and came to the conclusion that it isn't feasible. When a thread  
>>> is blocked on FC.down(), waiting for credits, then I can  
>>> interrupt it, but it will go right back in its loop waiting for  
>>> credits, therefore blocking again. Unless I change the loop  
>>> condition (running & waiting-for-credits), the thread will always  
>>> block if there aren't enough credits available.
>>>
>>> Thoughts ?
>>>
>>>
>>> Manik Surtani wrote:
>>>>
>>>>
>>>> Ok, I spent a bit of time investigating this, and the original  
>>>> sync blocks in the CacheLoaderInterceptor, superceded by the  
>>>> synchronization on an Fqn in subsequent versions are all hacks  
>>>> to get around the fact that some cache loader implementations  
>>>> themselves aren't thread safe.
>>>>
>>>> I wrote a test to check the thread safety of cache loader impls,  
>>>> and most of what we have fail.  I patched the  
>>>> DummyInMemoryCacheLoader to be thread safe (a lot of  
>>>> synchronization within the impl) and removed the unnecessary  
>>>> locks in the interceptors and things work fine.  I tried this  
>>>> with some of the few thread-safe cache loader impls we have  
>>>> (BDBJE, for example) and this worked well as well.
>>>>
>>>> So what I propose is this:
>>>>
>>>> 1) Patch CLI to NOT use the Fqn locks (in  
>>>> BaseCacheLoaderInterceptor)
>>>> 2) Make sure we acquire locks (lock() call up the interceptor  
>>>> stack) *before* attempting to load node, but after creating temp  
>>>> node
>>>> 3) Make sure cache loader impls are thread safe
>>>> 4) Add thread safety test to CacheLoaderTestBase
>>>> 5) Patch loaders that aren't threadsafe and fail test in (4) -  
>>>> FileCacheLoader, JDBCCacheLoader, etc.
>>>>
>>>> I'm going to do this on my local checkout of HEAD and run the  
>>>> regression tests overnight.  If this works, I suggest  
>>>> backporting this to 1.4.x and regressing there.
>>>>
>>>> WDYT?
>>>
>>> -- 
>>> Bela Ban
>>> Lead JGroups / JBoss Clustering team
>>> JBoss - a division of Red Hat
>>
>> -- 
>> Manik Surtani
>>
>> Lead, JBoss Cache
>> JBoss, a division of Red Hat
>>
>>
>>
>
> -- 
> Bela Ban
> Lead JGroups / JBoss Clustering team
> JBoss - a division of Red Hat
>

--
Manik Surtani

Lead, JBoss Cache
JBoss, a division of Red Hat