[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103
Manik Surtani
manik at jboss.org
Fri Jun 15 05:16:42 EDT 2007
Yes, it would timeout, but that is better than the deadlock that
currently occurs. IMO I think the timeout is a valid response to
such a call. If the CCL cannot complete a remote call because of a
remote lock, it should timeout. And when it does, it releases the
lock on Fqn-A and the update originating remotely can proceed.
The problem with making the CCL call without a lock is that this
exposes concurrent loading and overwriting for all cache loaders (the
CCL is treated as a simple cache loader impl by the
CacheLoaderInterceptor). This also has implications with race
conditions on eviction (Without locks in the cache loader
interceptor, the following could happen: Thread-1 does a get(), goes
through the cache loader interceptor, sees the node requested in
memory and does not load. Eviction-thread gets a WL on the same node
and evicts it. Thread-1 now gets to the PessimisticLockInterceptor,
cannot find the node, and since this is a get() call, doesn't create
the node but returns a null)
On 15 Jun 2007, at 08:09, Bela Ban wrote:
> I looked at 1103 in a bit more detail and concluded that the change
> in JGroups (http://jira.jboss.com/jira/browse/JGRP-533) would not
> help. The underlying issue is that (a) 2.4.1 has a single incoming
> request queue and (b) the ClusteredCacheLoader holds a lock while
> making a cluster-wide call. Let's look at an example:
>
> 1. The CCL acquires a lock on Fqn-A and makes a cluster-wide call
> 2. We get a replication message for A (RM-A), so someone made an
> update to A and is now trying to commit the change. RM-A tries to
> acquire the lock on A, but is blocked because the CCL holds it.
> 3. Now a result for the CCL call arrives. It is not processed
> (single
> queue) until RM-A gets processed. However, that's not the case
> until the CCL call completes. In this case, the only way for the
> CCL call to complete is via a timeout, as it will never get its
> results.
>
> So even if I implemented 533, it wouldn't help, as the interleaving
> between CCL calls and RM messages for the same FQNs would lead to
> timeouts.
>
> Now, a possible solution to 1103 is that we make the CCL call
> *without* holding a lock. When we get the result(s), only *then* do
> we acquire a lock and update the FQN. We also need to check whether
> FQN-A was updated in the mean time and then decide which value to
> return (the value set by the RM or the one gotten from the CCL call).
>
> WDYT ?
>
> --
> Bela Ban
> Lead JGroups / JBoss Clustering team
> JBoss - a division of Red Hat
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
More information about the jbosscache-dev
mailing list