Ok, this seems to work, making things a lot more 'correct'.
But before I roll this into an official release and start making changes
en-masse, porting this to 1.4.x and 2.0.0, I'd like to step back and think
about whether this is what we really want. Here is what I've
effectively done with 1.3.0.SP4, all related to pessimistic locking only:
a) Added a mechanism for not removing nodes when remove() is
called, and instead storing them in a map which can be referenced by
concurrent threads and locks attempted. (Mutated version of Brian's
original fix to JBCACHE-871)
b) When locking nodes in PLI.lock(), added a mechanism to obtain
a WL on a node if the next node after it needs to be created or
removed. (JBCACHE-875)
c) Modified PLI.lock() to start with Fqn.ROOT rather than
Fqn.get(0), which applies the same cache-wide locking behaviour to the root
as well. Prior to this, the root never was locked for
anything.
The implications of these, for the sake of accuracy and correctness,
are possibly:
1) Performance impact on inspecting nodes in b) to decide on
whether WLs are needed
2) Memory impact on maintaining a map of removed nodes in
a)
3) Reduced concurrency due to overall stronger locks in b)
4) Much reduced concurrency because of the locking in c)
5) Potential of more deadlocks/timeouts because of 3) and 4)
above.
Of the above, 5) manifests itself in a few unit tests that have now
started to fail (TxCacheLoaderTest, some state transfer tests, etc.).
Simple example, taken from one of the failing tests, leads to a
deadlock:
1: mgr.begin();
2: Transaction
tx=mgr.getTransaction();
3: cache1.put("/one/two/three", "key1",
"val1");
4: assertNull(cache2.get("/one/two/three",
"key1"));
5: tx.commit();
Line 3 obtains a WL on "/" on cache1, for GTX 1
Line 4 obtains a WL on "/" on cache2, for GTX 2
Line 5, on replication, tries to get a WL on "/" on cache2, for GTX
1
Both GTXs relate to the same TX since they are in the same thread.
Boom, deadlock.
One thing here though, in my opinion, another bug in the original
PLI:
When doing a get on a node that doesn't exist, intermediate nodes are
created. E.g., cache2.get("/one/two/three", "key1") actually ends
up creating /one/two/three first, and after the JBCACHE-875 fix, /, /one and
/one/two will be WL'd for a get() on a nonexistent node!! Shouldn't
the loop just be short-circuited such that at any point, if the next node
does not exist and the lock_type requested is READ, just return a
null? Saves us a whole bunch of unnecessary WL's ...
Sorry about the long and rambling email. Thoughts and
opinions?
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Telephone: +44 7786 702 706
Yahoo/AIM/Skype: maniksurtani
On 27 Nov 2006, at 10:16, Manik Surtani wrote:
Ok, take away the crap about it being a bug in the
util.concurrent code. It's a bug in JBC, specifically on line 75 in
TreeCache.java:
protected
DataNode root =
NodeFactory.getInstance().createDataNode(NodeFactory.NODE_TYPE_TREENODE,
SEPARATOR, Fqn.fromString(SEPARATOR), null, null, this);
:-) The root node is initialised when new TreeCache() is
called, well before isolation levels, etc. are set, which causes the root
node to be created with isolation level of NONE. Hence the insane
behaviour when trying to content for write locks on the root node.
Just fixed this, running a bunch of regressions now.
Cheers,
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Telephone: +44 7786 702 706
Yahoo/AIM/Skype: maniksurtani
On 26 Nov 2006, at 19:04, Bela Ban wrote:
Forwarding to the entire group
Manik
Surtani wrote:
Ok, boiled it down to a contention issue on locking
Fqn.ROOT, which prior to JBCACHE-875, was never locked - either for
reading or writing. (I do this by changing the loop in the
lock() method in PLI to first consider the root before the individual
Fqn elements). (This problem is also apparent in
o.j.c.transaction.DeadlockTest on a multi-cpu box).
2006-11-26 14:58:09,566 DEBUG [Node] (Thread-2) acquiring WL:
fqn=/, caller=GlobalTransaction:<null>:2,
lock=<unlocked>
<snip>
2006-11-26 14:58:09,572 DEBUG [Node] (Thread-3) acquiring WL:
fqn=/, caller=GlobalTransaction:<null>:3,
lock=<unlocked>
<snip>
2006-11-26 14:58:09,576 DEBUG [Node] (Thread-2) acquired WL:
fqn=/, caller=GlobalTransaction:<null>:2, lock=write
owner=GlobalTransaction:<null>:2
<snip>
2006-11-26 14:58:09,581 INFO [TxInterceptor] (Thread-3)
There was a problem handling this request
java.lang.IllegalStateException: there is already a writer
holding the lock: GlobalTransaction:<null>:2 and caller is
GlobalTransaction:<null>:3
at
org.jboss.cache.lock.LockMap.setWriterIfNotNull(LockMap.java:101)
at
org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:187)
at
org.jboss.cache.Node.acquireWriteLock(Node.java:557)
at
org.jboss.cache.Node.acquire(Node.java:517)
< snip - lots>
2006-11-26 14:58:09,850 DEBUG [Node] (Thread-2) created child:
fqn=/, child_name=NODE
I can't understand why concurrent WL acquisition in
IdentityLock.acquireWriteLock() behaves correctly for all nodes except
the root node. As you can see in the log snippet above, both
Thread-2 and Thread-3 call IdentityLock.acquireWriteLock (line 178)
and get a 'true', and one of the threads cause an exception on line
187.
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Telephone: +44 7786 702 706
Yahoo/AIM/Skype: maniksurtani
On 26 Nov 2006, at 13:54, Manik Surtani wrote:
I didn't want to acquire the WL immediately
since it involved an additional test to check if the next node in
the fqn needed creation. But I went with that algorithm
in the end since the locks had problems with concurrent readers
attempting to upgrade to writers.
So most of the regressions pass, as well as the new tests
introduced, and I am very close to something working, EXCEPT one
very strange problem with the IdentityLock
and ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modifications()
- get the latest on the 1.3.0 branch for this to make any
sense. The problem is between lines 178 and 187 of
IdentityLock, in the acquireWriteLock() method.
Attempting to get a hold of a write lock returns true, but
setting the writer throws an exception since another writer
exists. Odd that this happens since the calling thread should
have the semaphore by then, also odd that this only seems to happen
in this one test which is meant to test concurrency in the
CacheLoaderInterceptor.
I'm still investigating, but if you have any ideas about how
and why this may happen, I'd love to hear it ... :-)
Cheers,
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Telephone: +44 7786 702 706
Yahoo/AIM/Skype: maniksurtani
On 24 Nov 2006, at 15:25, Bela Ban wrote:
Manik Surtani wrote:
On 24 Nov 2006, at 14:44, Bela Ban wrote:
The first one you mentioned can lead to race conditions,
depending on the order of whether the upgrade on b or the
creation/WL on c happens first. What I've implemented is
more like:
1: Acquire RL on a
2: Acquire RL on b
3: Identify that we need to create c.
4: Upgrade RL on b to WL
5: *now* create c, and acquire WL on it.
So there is a possibility that step 4 may block until other
readers on b release their locks, but no one else could grab the
WL since the current TX will have a
RL.
I see. Why don't you acquire a WL
on b (step 2) *immediately* rather than going through the upgrade
if you know you have to acquire a WL later anyway ? You might
still deadlock:
2: acquire RL on b
(in the meantime): some
other TX acquires a RL on b, possibly upgrades to WL
3: you
want to acquire a WL on b and block on the other TX's RL or
WL
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat