[jbosscache-dev] ClusteredCacheLoader deadlocks and JBCCACHE-1103

Fri Jun 15 03:09:25 EDT 2007

I looked at 1103 in a bit more detail and concluded that the change in 
JGroups (http://jira.jboss.com/jira/browse/JGRP-533) would not help. The 
underlying issue is that (a) 2.4.1 has a single incoming request queue 
and (b) the ClusteredCacheLoader holds a lock while making a 
cluster-wide call. Let's look at an example:

   1. The CCL acquires a lock on Fqn-A and makes a cluster-wide call
   2. We get a replication message for A (RM-A), so someone made an
      update to A and is now trying to commit the change. RM-A tries to
      acquire the lock on A, but is blocked because the CCL holds it.
   3. Now a result for the CCL call arrives. It is not processed (single
      queue) until RM-A gets processed. However, that's not the case
      until the CCL call completes. In this case, the only way for the
      CCL call to complete is via a timeout, as it will never get its
      results.

So even if I implemented 533, it wouldn't help, as the interleaving 
between CCL calls and RM messages for the same FQNs would lead to timeouts.

Now, a possible solution to 1103 is that we make the CCL call *without* 
holding a lock. When we get the result(s), only *then* do we acquire a 
lock and update the FQN. We also need to check whether FQN-A was updated 
in the mean time and then decide which value to return (the value set by 
the RM or the one gotten from the CCL call).

WDYT ?

-- 
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat