[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Mon Jun 18 19:14:00 EDT 2007

Simply awesome!

On Tue, 2007-06-19 at 00:08 +0100, Manik Surtani wrote:
> On 18 Jun 2007, at 11:48, Manik Surtani wrote:
> 
> >
> > On 15 Jun 2007, at 13:09, Bela Ban wrote:
> >
> >>
> >>
> >> Jason T. Greene wrote:
> >>> That should be ok though because the CCL will still timeout on that
> >>> lock. The original problem was that the same thread with the CCL  
> >>> lock
> >>> was blocking on an FC lock so that the CCL lock would never be  
> >>> released
> >>> (since the FC lock was higher in the stack).
> >>
> >> See my previous reply. Yes, it blocked on FC.down() because it  
> >> didn't receive credits in up(). But up() wasn't called because  
> >> there was a replication message ahead of it in the queue that  
> >> blocked on the FQN held by the CCL.
> >>
> >> So to tackle this, my suggestion were, in this order:
> >> #1 Don't hold a lock while making a synchronous cluster method  
> >> call. That's a big no no, especially in pre-2.5 releases. We had  
> >> lots of bugs in the clustering code due to such code. Then Brian  
> >> cleaned up all of it... :-)
> >> #2 The timeout mechanism in JGroups which uses threads. Ugly, and  
> >> a hack, and only needed for 2.4. As I argued, this will avoid the  
> >> deadlock, but it will constantly time out (assuming some traffic).
> >>
> >> The root cause of this is #1
> >
> > Let me look into why we had #1 anyway.  Originally the 1.2.x  
> > codebase used a synchronized block on the CacheLoaderInterceptor  
> > for this which meant that only one thread could pass through this  
> > interceptor at any given time.  I changed this to lock on the Fqn  
> > in question so at least if the Fqns didn't overlap multiple threads  
> > could go thru this interceptor.
> >
> > The reason behind it seems to be so that the CacheLoader impl does  
> > not have to deal with concurrent calls on the same node, but  
> > thinking about it, I feel this is something that should be handled  
> > in each CacheLoader impl, which should be thread safe.
> 
> Ok, I spent a bit of time investigating this, and the original sync  
> blocks in the CacheLoaderInterceptor, superceded by the  
> synchronization on an Fqn in subsequent versions are all hacks to get  
> around the fact that some cache loader implementations themselves  
> aren't thread safe.
> 
> I wrote a test to check the thread safety of cache loader impls, and  
> most of what we have fail.  I patched the DummyInMemoryCacheLoader to  
> be thread safe (a lot of synchronization within the impl) and removed  
> the unnecessary locks in the interceptors and things work fine.  I  
> tried this with some of the few thread-safe cache loader impls we  
> have (BDBJE, for example) and this worked well as well.
> 
> So what I propose is this:
> 
> 1) Patch CLI to NOT use the Fqn locks (in BaseCacheLoaderInterceptor)
> 2) Make sure we acquire locks (lock() call up the interceptor stack)  
> *before* attempting to load node, but after creating temp node
> 3) Make sure cache loader impls are thread safe
> 4) Add thread safety test to CacheLoaderTestBase
> 5) Patch loaders that aren't threadsafe and fail test in (4) -  
> FileCacheLoader, JDBCCacheLoader, etc.
> 
> I'm going to do this on my local checkout of HEAD and run the  
> regression tests overnight.  If this works, I suggest backporting  
> this to 1.4.x and regressing there.
> 
> WDYT?
> 
> Cheers,
> --
> Manik Surtani
> 
> Lead, JBoss Cache
> JBoss, a division of Red Hat
> 
> 
> 
> _______________________________________________
> jbosscache-dev mailing list
> jbosscache-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/jbosscache-dev
-- 
Jason T. Greene
Lead, POJO Cache
JBoss, a division of Red Hat