[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Fri Jun 15 07:59:05 EDT 2007

I don't think timeouts are very useful in this scenario. You are liable 
to get updates from other nodes for A. You are liable to read A locally, 
causing the CCL to make a cluster-wide call. Both CCL calls and updates 
from other nodes are likely to be interspersed. Each timeout this occurs 
you have to wait until the timeout elapses, say that's 10 seconds. This 
might even cause the remote update(s) to fail. This is compounded when 
you have more than 1 thread making updates and/or causing CCL calls to 
happen, which is likely with HTTP session replication (not sure though 
if they hit the same region, maybe only happens with multi-frame or 
AJAX-like apps).

Yes, deadlocks are avoided, but the system is practically unusable in 
such a case. Fact is that it is bad practice to hold a lock and then 
make a cluster-wide call, as this is liable to deadlock (or 
timeout-lock). I wonder if the lock mechanism in the CacheLoader can be 
overridden in CCL, so that you don't acquire a lock until after the 
remote call has been made ?

In general, one should never hold a lock while doing something that can 
potentially block, e.g. in JGroups I do the following (edited):

 while(length > lowest_credit && running) {
     boolean rc=credits_available.await(max_block_time, 
TimeUnit.MILLISECONDS);
     if(rc || length <= lowest_credit || !running)
         break;
     long wait_time=System.currentTimeMillis() - last_credit_request;
     if(wait_time >= max_block_time) {
         last_credit_request=System.currentTimeMillis();
         // we need to send the credit requests down *without* holding 
the sent_lock, otherwise we might
         // run into the deadlock described in 
http://jira.jboss.com/jira/browse/JGRP-292
         Map<Address,Long> sent_copy=new HashMap<Address,Long>(sent);
         sent_copy.keySet().retainAll(creditors);
         sent_lock.unlock();
         try {
             for(Map.Entry<Address,Long> entry: sent_copy.entrySet()) {
                   sendCreditRequest(entry.getKey(), entry.getValue());
             }
         }
         finally {
              sent_lock.lock();
         }
   }

Manik Surtani wrote:
> Yes, it would timeout, but that is better than the deadlock that 
> currently occurs.  IMO I think the timeout is a valid response to such 
> a call.  If the CCL cannot complete a remote call because of a remote 
> lock, it should timeout.  And when it does, it releases the lock on 
> Fqn-A and the update originating remotely can proceed.
>
> The problem with making the CCL call without a lock is that this 
> exposes concurrent loading and overwriting for all cache loaders (the 
> CCL is treated as a simple cache loader impl by the 
> CacheLoaderInterceptor).  This also has implications with race 
> conditions on eviction (Without locks in the cache loader interceptor, 
> the following could happen: Thread-1 does a get(), goes through the 
> cache loader interceptor, sees the node requested in memory and does 
> not load.  Eviction-thread gets a WL on the same node and evicts it.  
> Thread-1 now gets to the PessimisticLockInterceptor, cannot find the 
> node, and since this is a get() call, doesn't create the node but 
> returns a null)
>
>
> On 15 Jun 2007, at 08:09, Bela Ban wrote:
>
>> I looked at 1103 in a bit more detail and concluded that the change 
>> in JGroups (http://jira.jboss.com/jira/browse/JGRP-533) would not 
>> help. The underlying issue is that (a) 2.4.1 has a single incoming 
>> request queue and (b) the ClusteredCacheLoader holds a lock while 
>> making a cluster-wide call. Let's look at an example:
>>
>>   1. The CCL acquires a lock on Fqn-A and makes a cluster-wide call
>>   2. We get a replication message for A (RM-A), so someone made an
>>      update to A and is now trying to commit the change. RM-A tries to
>>      acquire the lock on A, but is blocked because the CCL holds it.
>>   3. Now a result for the CCL call arrives. It is not processed (single
>>      queue) until RM-A gets processed. However, that's not the case
>>      until the CCL call completes. In this case, the only way for the
>>      CCL call to complete is via a timeout, as it will never get its
>>      results.
>>
>> So even if I implemented 533, it wouldn't help, as the interleaving 
>> between CCL calls and RM messages for the same FQNs would lead to 
>> timeouts.
>>
>> Now, a possible solution to 1103 is that we make the CCL call 
>> *without* holding a lock. When we get the result(s), only *then* do 
>> we acquire a lock and update the FQN. We also need to check whether 
>> FQN-A was updated in the mean time and then decide which value to 
>> return (the value set by the RM or the one gotten from the CCL call).
>>
>> WDYT ?
>>
>> -- 
>> Bela Ban
>> Lead JGroups / JBoss Clustering team
>> JBoss - a division of Red Hat
>
> -- 
> Manik Surtani
>
> Lead, JBoss Cache
> JBoss, a division of Red Hat
>
>
>

-- 
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat