Come to think of it. On the other side of the coin, the correctness of
ROOT node locking only has impact if you are manipulating the fqn under
root, e.g., "/a", correct? So as long as we advise the user not to do
this, the previous "defect" will still be OK.
So in this regard, untill we come up with multiple root access approach,
why not just document this? I am just concerned if someone needs to lock
ROOT, performance will be pretty bad.
But of course, not to lock ROOT is not to operate on "/a", e.g. create
use multiple level fqn. So there goes my argument then.
________________________________
From: Brian Stansberry
Sent: Wednesday, November 29, 2006 7:11 AM
To: Manik Surtani; Ben Wang; Bela Ban; Vladimir Blagojevic; Galder
Zamarreno; jbosscache-dev(a)lists.jboss.org
Subject: RE: Fundamental problem with pessimistic locking
Hmm; thinking out loud....
If you're not doing lots of adds/removes to the level below root, it's
not a big cost, right?
And if you are doing lots of adds/removes at that level, it's probably
important to have the correct locking behavior.
So, seems OK to me.
________________________________
From: Manik Surtani [mailto:manik@jboss.org]
Sent: Tuesday, November 28, 2006 4:58 PM
To: Ben Wang; Bela Ban; Brian Stansberry; Vladimir Blagojevic;
Galder Zamarreno; jbosscache-dev(a)lists.jboss.org
Subject: Re: Fundamental problem with pessimistic locking
So we still haven't discussed my biggest concern here, which is
item 5) below in the list of implications. Is this performance penalty
and potential for deadlocks small enough a price to pay for the
correctness of concurrent access on the root node? What do people
think?
From: Manik Surtani [mailto:manik@jboss.org]
Sent: Monday, November 27, 2006 7:19 PM
To: Manik Surtani
Cc: Bela Ban; Ben Wang; Brian Stansberry; Vladimir
Blagojevic; Galder Zamarreno
Subject: Re: Fundamental problem with pessimistic
locking
Ok, this seems to work, making things a lot more
'correct'. But before I roll this into an official release and start
making changes en-masse, porting this to 1.4.x and 2.0.0, I'd like to
step back and think about whether this is what we really want. Here is
what I've effectively done with 1.3.0.SP4, all related to pessimistic
locking only:
a) Added a mechanism for not removing nodes when
remove() is called, and instead storing them in a map which can be
referenced by concurrent threads and locks attempted. (Mutated version
of Brian's original fix to JBCACHE-871)
b) When locking nodes in PLI.lock(), added a mechanism
to obtain a WL on a node if the next node after it needs to be created
or removed. (JBCACHE-875)
c) Modified PLI.lock() to start with Fqn.ROOT rather
than Fqn.get(0), which applies the same cache-wide locking behaviour to
the root as well. Prior to this, the root never was locked for
anything.
The implications of these, for the sake of accuracy and
correctness, are possibly:
1) Performance impact on inspecting nodes in b) to
decide on whether WLs are needed
2) Memory impact on maintaining a map of removed nodes
in a)
3) Reduced concurrency due to overall stronger locks in
b)
4) Much reduced concurrency because of the locking in
c)
5) Potential of more deadlocks/timeouts because of 3)
and 4) above.
Of the above, 5) manifests itself in a few unit tests
that have now started to fail (TxCacheLoaderTest, some state transfer
tests, etc.). Simple example, taken from one of the failing tests,
leads to a deadlock:
1: mgr.begin();
2: Transaction tx=mgr.getTransaction();
3: cache1.put("/one/two/three", "key1", "val1");
4: assertNull(cache2.get("/one/two/three", "key1"));
5: tx.commit();
Line 3 obtains a WL on "/" on cache1, for GTX 1
Line 4 obtains a WL on "/" on cache2, for GTX 2
Line 5, on replication, tries to get a WL on "/" on
cache2, for GTX 1
Both GTXs relate to the same TX since they are in the
same thread. Boom, deadlock.
One thing here though, in my opinion, another bug in the
original PLI:
When doing a get on a node that doesn't exist,
intermediate nodes are created. E.g., cache2.get("/one/two/three",
"key1") actually ends up creating /one/two/three first, and after the
JBCACHE-875 fix, /, /one and /one/two will be WL'd for a get() on a
nonexistent node!! Shouldn't the loop just be short-circuited such that
at any point, if the next node does not exist and the lock_type
requested is READ, just return a null? Saves us a whole bunch of
unnecessary WL's ...
Sorry about the long and rambling email. Thoughts and
opinions?
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik(a)jboss.org
Telephone: +44 7786 702 706
MSN: manik(a)surtani.org
Yahoo/AIM/Skype: maniksurtani
On 27 Nov 2006, at 10:16, Manik Surtani wrote:
Ok, take away the crap about it being a bug in
the util.concurrent code. It's a bug in JBC, specifically on line 75 in
TreeCache.java:
protected DataNode root =
NodeFactory.getInstance().createDataNode(NodeFactory.NODE_TYPE_TREENODE,
SEPARATOR, Fqn.fromString(SEPARATOR), null, null, this);
:-) The root node is initialised when new
TreeCache() is called, well before isolation levels, etc. are set, which
causes the root node to be created with isolation level of NONE. Hence
the insane behaviour when trying to content for write locks on the root
node.
Just fixed this, running a bunch of regressions
now.
Cheers,
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik(a)jboss.org
Telephone: +44 7786 702 706
MSN: manik(a)surtani.org
Yahoo/AIM/Skype: maniksurtani
On 26 Nov 2006, at 19:04, Bela Ban wrote:
Forwarding to the entire group
Manik Surtani wrote:
Ok, boiled it down to a contention issue
on locking Fqn.ROOT, which prior to JBCACHE-875, was never locked -
either for reading or writing. (I do this by changing the loop in the
lock() method in PLI to first consider the root before the individual
Fqn elements). (This problem is also apparent in
o.j.c.transaction.DeadlockTest on a multi-cpu box).
2006-11-26 14:58:09,566 DEBUG [Node]
(Thread-2) acquiring WL: fqn=/, caller=GlobalTransaction:<null>:2,
lock=<unlocked>
<snip>
2006-11-26 14:58:09,572 DEBUG [Node]
(Thread-3) acquiring WL: fqn=/, caller=GlobalTransaction:<null>:3,
lock=<unlocked>
<snip>
2006-11-26 14:58:09,576 DEBUG [Node]
(Thread-2) acquired WL: fqn=/, caller=GlobalTransaction:<null>:2,
lock=write owner=GlobalTransaction:<null>:2
<snip>
2006-11-26 14:58:09,581 INFO
[TxInterceptor] (Thread-3) There was a problem handling this request
java.lang.IllegalStateException: there
is already a writer holding the lock: GlobalTransaction:<null>:2 and
caller is GlobalTransaction:<null>:3
at
org.jboss.cache.lock.LockMap.setWriterIfNotNull(LockMap.java:101)
at
org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:187
)
at
org.jboss.cache.Node.acquireWriteLock(Node.java:557)
at
org.jboss.cache.Node.acquire(Node.java:517)
< snip - lots>
2006-11-26 14:58:09,850 DEBUG [Node]
(Thread-2) created child: fqn=/, child_name=NODE
I can't understand why concurrent WL
acquisition in IdentityLock.acquireWriteLock() behaves correctly for all
nodes except the root node. As you can see in the log snippet above,
both Thread-2 and Thread-3 call IdentityLock.acquireWriteLock (line 178)
and get a 'true', and one of the threads cause an exception on line 187.
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik(a)jboss.org
Telephone: +44 7786 702 706
MSN: manik(a)surtani.org
Yahoo/AIM/Skype: maniksurtani
On 26 Nov 2006, at 13:54, Manik Surtani
wrote:
I didn't want to acquire the WL
immediately since it involved an additional test to check if the next
node in the fqn needed creation. But I went with that algorithm in the
end since the locks had problems with concurrent readers attempting to
upgrade to writers.
So most of the regressions pass, as well
as the new tests introduced, and I am very close to something working,
EXCEPT one very strange problem with the IdentityLock and
ConcurrentCreationDeadlockTest.testLocalCacheLoader2Modifications() -
get the latest on the 1.3.0 branch for this to make any sense. The
problem is between lines 178 and 187 of IdentityLock, in the
acquireWriteLock() method.
Attempting to get a hold of a write lock
returns true, but setting the writer throws an exception since another
writer exists. Odd that this happens since the calling thread should
have the semaphore by then, also odd that this only seems to happen in
this one test which is meant to test concurrency in the
CacheLoaderInterceptor.
I'm still investigating, but if you have
any ideas about how and why this may happen, I'd love to hear it ... :-)
Cheers,
--
Manik Surtani
Lead, JBoss Cache
JBoss, a division of Red Hat
Email: manik(a)jboss.org
Telephone: +44 7786 702 706
MSN: manik(a)surtani.org
Yahoo/AIM/Skype: maniksurtani
On 24 Nov 2006, at 15:25, Bela Ban
wrote:
Manik Surtani wrote:
On 24 Nov 2006, at 14:44, Bela Ban
wrote:
The first one you mentioned can lead to
race conditions, depending on the order of whether the upgrade on b or
the creation/WL on c happens first. What I've implemented is more like:
1: Acquire RL on a
2: Acquire RL on b
3: Identify that we need to create c.
4: Upgrade RL on b to WL
5: *now* create c, and acquire WL on it.
So there is a possibility that step 4
may block until other readers on b release their locks, but no one else
could grab the WL since the current TX will have a RL.
I see. Why don't you acquire a WL on b
(step 2) *immediately* rather than going through the upgrade if you know
you have to acquire a WL later anyway ? You might still deadlock:
2: acquire RL on b
(in the meantime): some other TX
acquires a RL on b, possibly upgrades to WL
3: you want to acquire a WL on b and
block on the other TX's RL or WL
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat
--
Bela Ban
Lead JGroups / JBoss Clustering team
JBoss - a division of Red Hat