New subject: Fundamental problem with pessimistic locking

Monday, 27 November 2006

Manik Surtani wrote:
...
 Ok, this seems to work, making things a lot more 'correct'. 
But 
 before I roll this into an official release and start making changes 
 en-masse, porting this to 1.4.x and 2.0.0, I'd like to step back and 
 think about whether this is what we really want.  Here is what I've 
 effectively done with 1.3.0.SP4, all related to pessimistic locking only:

 a)  Added a mechanism for not removing nodes when remove() is called, 
 and instead storing them in a map which can be referenced by 
 concurrent threads and locks attempted.  (Mutated version of Brian's 
 original fix to JBCACHE-871)
 b)  When locking nodes in PLI.lock(), added a mechanism to obtain a WL 
 on a node if the next node after it needs to be created or removed.  
 (JBCACHE-875)
 c)  Modified PLI.lock() to start with Fqn.ROOT rather than Fqn.get(0), 
 which applies the same cache-wide locking behaviour to the root as 
 well.  Prior to this, the root never was locked for anything.  

 The implications of these, for the sake of accuracy and correctness, 
 are possibly:

 1)  Performance impact on inspecting nodes in b) to decide on whether 
 WLs are needed
 2)  Memory impact on maintaining a map of removed nodes in a) 
- Can't we simply mark a node as REMOVED (special attribute in that 
node's hashmap), so we don't need an additional list ?
- What happens if you remove an entire subtree (2000 nodes) in a TX ? Is 
the current list a list of Fqns, or a copy of the nodes ?

...
 3)  Reduced concurrency due to overall stronger locks in b)
 4)  Much reduced concurrency because of the locking in c) 
Can we resume the discussion initiated by Brian Dueck some time ago, to 
provide the concept of 'multiple roots' ? We could create N buckets, and 
the hashcode of the first elements of an FQN selects the bucket 
(subtree) into which the FQN will go. The N buckets are NEVER removed, 
but they are not visible to the user. Therefore, if we have to lock the 
root, we never actually lock "/", but e.g. bucket #24.

...
 5)  Potential of more deadlocks/timeouts because of 3) and 4) above.

 Of the above, 5) manifests itself in a few unit tests that have now 
 started to fail (TxCacheLoaderTest, some state transfer tests, etc.).  
 Simple example, taken from one of the failing tests, leads to a deadlock:

 1: mgr.begin();
 2: Transaction tx=mgr.getTransaction();
 3: cache1.put("/one/two/three", "key1", "val1");
 4: assertNull(cache2.get("/one/two/three", "key1"));
 5: tx.commit();

 Line 3 obtains a WL on "/" on cache1, for GTX 1
 Line 4 obtains a WL on "/" on cache2, for GTX 2
 Line 5, on replication, tries to get a WL on "/" on cache2, for GTX 1

 Both GTXs relate to the same TX since they are in the same thread. 
  Boom, deadlock.

 One thing here though, in my opinion, another bug in the original PLI:

 When doing a get on a node that doesn't exist, intermediate nodes are 
 created.  E.g., cache2.get("/one/two/three", "key1") actually ends up

 creating /one/two/three first, and after the JBCACHE-875 fix, /, /one 
 and /one/two will be WL'd for a get() on a nonexistent node!!  
 Shouldn't the loop just be short-circuited such that at any point, if 
 the next node does not exist and the lock_type requested is READ, just 
 return a null?  Saves us a whole bunch of unnecessary WL's ... 
Absolutely, this is a bug ! We should never create nodes for GET or 
REMOVE !!

-- 
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: Fundamental problem with pessimistic locking