[jbosscache-dev] Re: Fundamental problem with pessimistic locking
Bela Ban
bela at jboss.org
Mon Nov 27 12:18:52 EST 2006
Manik Surtani wrote:
> Ok, this seems to work, making things a lot more 'correct'. But
> before I roll this into an official release and start making changes
> en-masse, porting this to 1.4.x and 2.0.0, I'd like to step back and
> think about whether this is what we really want. Here is what I've
> effectively done with 1.3.0.SP4, all related to pessimistic locking only:
>
> a) Added a mechanism for not removing nodes when remove() is called,
> and instead storing them in a map which can be referenced by
> concurrent threads and locks attempted. (Mutated version of Brian's
> original fix to JBCACHE-871)
> b) When locking nodes in PLI.lock(), added a mechanism to obtain a WL
> on a node if the next node after it needs to be created or removed.
> (JBCACHE-875)
> c) Modified PLI.lock() to start with Fqn.ROOT rather than Fqn.get(0),
> which applies the same cache-wide locking behaviour to the root as
> well. Prior to this, the root never was locked for anything.
>
> The implications of these, for the sake of accuracy and correctness,
> are possibly:
>
> 1) Performance impact on inspecting nodes in b) to decide on whether
> WLs are needed
> 2) Memory impact on maintaining a map of removed nodes in a)
- Can't we simply mark a node as REMOVED (special attribute in that
node's hashmap), so we don't need an additional list ?
- What happens if you remove an entire subtree (2000 nodes) in a TX ? Is
the current list a list of Fqns, or a copy of the nodes ?
> 3) Reduced concurrency due to overall stronger locks in b)
> 4) Much reduced concurrency because of the locking in c)
Can we resume the discussion initiated by Brian Dueck some time ago, to
provide the concept of 'multiple roots' ? We could create N buckets, and
the hashcode of the first elements of an FQN selects the bucket
(subtree) into which the FQN will go. The N buckets are NEVER removed,
but they are not visible to the user. Therefore, if we have to lock the
root, we never actually lock "/", but e.g. bucket #24.
> 5) Potential of more deadlocks/timeouts because of 3) and 4) above.
>
> Of the above, 5) manifests itself in a few unit tests that have now
> started to fail (TxCacheLoaderTest, some state transfer tests, etc.).
> Simple example, taken from one of the failing tests, leads to a deadlock:
>
> 1: mgr.begin();
> 2: Transaction tx=mgr.getTransaction();
> 3: cache1.put("/one/two/three", "key1", "val1");
> 4: assertNull(cache2.get("/one/two/three", "key1"));
> 5: tx.commit();
>
> Line 3 obtains a WL on "/" on cache1, for GTX 1
> Line 4 obtains a WL on "/" on cache2, for GTX 2
> Line 5, on replication, tries to get a WL on "/" on cache2, for GTX 1
>
> Both GTXs relate to the same TX since they are in the same thread.
> Boom, deadlock.
>
> One thing here though, in my opinion, another bug in the original PLI:
>
> When doing a get on a node that doesn't exist, intermediate nodes are
> created. E.g., cache2.get("/one/two/three", "key1") actually ends up
> creating /one/two/three first, and after the JBCACHE-875 fix, /, /one
> and /one/two will be WL'd for a get() on a nonexistent node!!
> Shouldn't the loop just be short-circuited such that at any point, if
> the next node does not exist and the lock_type requested is READ, just
> return a null? Saves us a whole bunch of unnecessary WL's ...
Absolutely, this is a bug ! We should never create nodes for GET or
REMOVE !!
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat
More information about the jbosscache-dev
mailing list