Re: [jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Tuesday, 19 June 2007

Scratch what I said abt the Bdbje CL - it gets it's knickers in a  
twist as well with enough CPUs/cores - the 4-cpu lab servers got the  
BDBJE engine in a deadlock with concurrent reads, writes and  
removes.  Will have to apply the same StripedLock access check to  
this as well.

On 19 Jun 2007, at 12:32, Bela Ban wrote:

...
 OK, yes, having remote puts and CCL'induced gets not block on the

 same lock is certainly a much better solution.

 What will also help is JGroups 2.5 *should* we still encounter a  
 deadlock.

 BTW: I assume using the Multiplexer here is out of the question ?  
 You could use a separate stack in the Multiplexer (e.g. "udp-repl- 
 sync") which is only used by the CCL, and has no FC in it, and  
 everyone else uses the async stack.

 But then again, there was a reason we didn't configure the  
 Multiplexer to be the defaut in 2.4...

 Manik Surtani wrote:
>
> On 19 Jun 2007, at 09:04, Bela Ban wrote:
>
>> Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock  
>> refactoring, sounds like it needs to be done anyway. But, because  
>> you have to be thread safe at one point, aren't you deferring the  
>> issue to a later stage ? At one point you have to acquire a lock  
>> to prevent concurrent loading, and I'm afraid we'll have the same  
>> issue.
>>
>> But, wait, if you defer lock acquisition till after the  
>> synchronous cluster method call, you should be fine. Is this what  
>> you have in mind ?
>
> There are 2 locks here that need to be done.  One lock is on the  
> Node, via the lock interceptor, which happens *after* the cluster  
> method call.  This is not the problem.
>
> We also had a level of sync in the cache loader interceptor to  
> prevent the same Fqn being loaded twice from the CL (or a  
> simultaneous load and store, etc).  This bit was the problem,  
> since this sync had to happen *before* any loading was done by the  
> CL.
>
> Regarding the other CL impls, some of them already were threadsafe  
> (BDBJE, Jdbm) and some weren't (JDBC, File). I created a  
> StripedLock class which uses a set of ReentrantReadWriteLocks  
> applied to the Fqn in question, and the JDBC and File CLs now use  
> this to achieve thread safety (more efficient than synchronized  
> blocks and better mem usage than one lock per Fqn - usual benefits  
> of lock striping).
>
> You're right that the improvements in locking here are not quite  
> central to solving 1103, but somewhat related.
>
> The real solution to 1103 is in the CCL - remotely originating put 
> ()'s shouldn't need a load (and hence not need to wait on the Fqn  
> being loaded) *only if* we're talking about the CCL, since CCLs  
> don't perform a corresponding store (they're read-only).   
> Therefore this logic cannot be in the CLI as it doesn't apply to  
> all CL impls.
>
>
>>
>> On the same topic: I looked into interrupting clustered method  
>> calls in JGroups (http://jira.jboss.com/jira/browse/JGRP-533),  
>> and came to the conclusion that it isn't feasible. When a thread  
>> is blocked on FC.down(), waiting for credits, then I can  
>> interrupt it, but it will go right back in its loop waiting for  
>> credits, therefore blocking again. Unless I change the loop  
>> condition (running & waiting-for-credits), the thread will always  
>> block if there aren't enough credits available.
>>
>> Thoughts ?
>>
>>
>> Manik Surtani wrote:
>>>
>>>
>>> Ok, I spent a bit of time investigating this, and the original  
>>> sync blocks in the CacheLoaderInterceptor, superceded by the  
>>> synchronization on an Fqn in subsequent versions are all hacks  
>>> to get around the fact that some cache loader implementations  
>>> themselves aren't thread safe.
>>>
>>> I wrote a test to check the thread safety of cache loader impls,  
>>> and most of what we have fail.  I patched the  
>>> DummyInMemoryCacheLoader to be thread safe (a lot of  
>>> synchronization within the impl) and removed the unnecessary  
>>> locks in the interceptors and things work fine.  I tried this  
>>> with some of the few thread-safe cache loader impls we have  
>>> (BDBJE, for example) and this worked well as well.
>>>
>>> So what I propose is this:
>>>
>>> 1) Patch CLI to NOT use the Fqn locks (in  
>>> BaseCacheLoaderInterceptor)
>>> 2) Make sure we acquire locks (lock() call up the interceptor  
>>> stack) *before* attempting to load node, but after creating temp  
>>> node
>>> 3) Make sure cache loader impls are thread safe
>>> 4) Add thread safety test to CacheLoaderTestBase
>>> 5) Patch loaders that aren't threadsafe and fail test in (4) -  
>>> FileCacheLoader, JDBCCacheLoader, etc.
>>>
>>> I'm going to do this on my local checkout of HEAD and run the  
>>> regression tests overnight.  If this works, I suggest  
>>> backporting this to 1.4.x and regressing there.
>>>
>>> WDYT?
>>
>> -- 
>> Bela Ban
>> Lead JGroups / JBoss Clustering team
>> JBoss - a division of Red Hat
>
> -- 
> Manik Surtani
>
> Lead, JBoss Cache
> JBoss, a division of Red Hat
>
>
>

 -- 
 Bela Ban
 Lead JGroups / JBoss Clustering team
 JBoss - a division of Red Hat

--
Manik Surtani

Lead, JBoss Cache
JBoss, a division of Red Hat

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103