[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103
Manik Surtani
manik at jboss.org
Tue Jun 19 10:28:44 EDT 2007
Scratch what I said abt the Bdbje CL - it gets it's knickers in a
twist as well with enough CPUs/cores - the 4-cpu lab servers got the
BDBJE engine in a deadlock with concurrent reads, writes and
removes. Will have to apply the same StripedLock access check to
this as well.
On 19 Jun 2007, at 12:32, Bela Ban wrote:
> OK, yes, having remote puts and CCL'induced gets not block on the
> same lock is certainly a much better solution.
>
> What will also help is JGroups 2.5 *should* we still encounter a
> deadlock.
>
> BTW: I assume using the Multiplexer here is out of the question ?
> You could use a separate stack in the Multiplexer (e.g. "udp-repl-
> sync") which is only used by the CCL, and has no FC in it, and
> everyone else uses the async stack.
>
> But then again, there was a reason we didn't configure the
> Multiplexer to be the defaut in 2.4...
>
> Manik Surtani wrote:
>>
>> On 19 Jun 2007, at 09:04, Bela Ban wrote:
>>
>>> Is this going to solve JBCACHE-1103 ? I mean, +1 for the lock
>>> refactoring, sounds like it needs to be done anyway. But, because
>>> you have to be thread safe at one point, aren't you deferring the
>>> issue to a later stage ? At one point you have to acquire a lock
>>> to prevent concurrent loading, and I'm afraid we'll have the same
>>> issue.
>>>
>>> But, wait, if you defer lock acquisition till after the
>>> synchronous cluster method call, you should be fine. Is this what
>>> you have in mind ?
>>
>> There are 2 locks here that need to be done. One lock is on the
>> Node, via the lock interceptor, which happens *after* the cluster
>> method call. This is not the problem.
>>
>> We also had a level of sync in the cache loader interceptor to
>> prevent the same Fqn being loaded twice from the CL (or a
>> simultaneous load and store, etc). This bit was the problem,
>> since this sync had to happen *before* any loading was done by the
>> CL.
>>
>> Regarding the other CL impls, some of them already were threadsafe
>> (BDBJE, Jdbm) and some weren't (JDBC, File). I created a
>> StripedLock class which uses a set of ReentrantReadWriteLocks
>> applied to the Fqn in question, and the JDBC and File CLs now use
>> this to achieve thread safety (more efficient than synchronized
>> blocks and better mem usage than one lock per Fqn - usual benefits
>> of lock striping).
>>
>> You're right that the improvements in locking here are not quite
>> central to solving 1103, but somewhat related.
>>
>> The real solution to 1103 is in the CCL - remotely originating put
>> ()'s shouldn't need a load (and hence not need to wait on the Fqn
>> being loaded) *only if* we're talking about the CCL, since CCLs
>> don't perform a corresponding store (they're read-only).
>> Therefore this logic cannot be in the CLI as it doesn't apply to
>> all CL impls.
>>
>>
>>>
>>> On the same topic: I looked into interrupting clustered method
>>> calls in JGroups (http://jira.jboss.com/jira/browse/JGRP-533),
>>> and came to the conclusion that it isn't feasible. When a thread
>>> is blocked on FC.down(), waiting for credits, then I can
>>> interrupt it, but it will go right back in its loop waiting for
>>> credits, therefore blocking again. Unless I change the loop
>>> condition (running & waiting-for-credits), the thread will always
>>> block if there aren't enough credits available.
>>>
>>> Thoughts ?
>>>
>>>
>>> Manik Surtani wrote:
>>>>
>>>>
>>>> Ok, I spent a bit of time investigating this, and the original
>>>> sync blocks in the CacheLoaderInterceptor, superceded by the
>>>> synchronization on an Fqn in subsequent versions are all hacks
>>>> to get around the fact that some cache loader implementations
>>>> themselves aren't thread safe.
>>>>
>>>> I wrote a test to check the thread safety of cache loader impls,
>>>> and most of what we have fail. I patched the
>>>> DummyInMemoryCacheLoader to be thread safe (a lot of
>>>> synchronization within the impl) and removed the unnecessary
>>>> locks in the interceptors and things work fine. I tried this
>>>> with some of the few thread-safe cache loader impls we have
>>>> (BDBJE, for example) and this worked well as well.
>>>>
>>>> So what I propose is this:
>>>>
>>>> 1) Patch CLI to NOT use the Fqn locks (in
>>>> BaseCacheLoaderInterceptor)
>>>> 2) Make sure we acquire locks (lock() call up the interceptor
>>>> stack) *before* attempting to load node, but after creating temp
>>>> node
>>>> 3) Make sure cache loader impls are thread safe
>>>> 4) Add thread safety test to CacheLoaderTestBase
>>>> 5) Patch loaders that aren't threadsafe and fail test in (4) -
>>>> FileCacheLoader, JDBCCacheLoader, etc.
>>>>
>>>> I'm going to do this on my local checkout of HEAD and run the
>>>> regression tests overnight. If this works, I suggest
>>>> backporting this to 1.4.x and regressing there.
>>>>
>>>> WDYT?
>>>
>>> --
>>> Bela Ban
>>> Lead JGroups / JBoss Clustering team
>>> JBoss - a division of Red Hat
>>
>> --
>> Manik Surtani
>>
>> Lead, JBoss Cache
>> JBoss, a division of Red Hat
>>
>>
>>
>
> --
> Bela Ban
> Lead JGroups / JBoss Clustering team
> JBoss - a division of Red Hat
>
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
More information about the jbosscache-dev
mailing list