Problems with NodeLocking algorithm
Hi all
I?m fascinated with TreeCache and I think that this is a very good and powerful product.
I?m trying to use it in my application where many concurrent puts/removes occurs ? and
this is not easy task, because there are many unreasonable exceptions raised from
TreeCache under heavy-load.
Because my application must work 24/7 without any unreasonable exceptions raised form any
component I do some investigation and found, that node locking algorithm used by TreeCache
(when using Pessimistic locking scheme) is not good enough.
Main ?work unit? in TreeCache is Node. Nodes have hierarchical structure like folders on
disk. There is root node, root node has children, children have own children etc.
If operation needs modify data in node, lock must be acquired. Physically locks are a part
of Node.
If we put data into /a/b/c/d node (on clear cache), first lock on / (root) node must be
acquired. Then if / doesn?t contains a child, /a node must be created and lock must be
acquired etc.
If all required nodes from path are locked with correct lock type (RL, WL), target
operation can be invoked.
This is very simple and generally this works correctly in single-thread applications. But
what happens when 2 threads are trying work on this same node?
Scenario 1:
There is /a node in cache.
Thread-0 is trying put new value into /a node, Thread-1 is trying remove /a node.
Both threads are in PessimistickLockInterceptor.lock() method, both treads have this same
reference to /a node.
Thread-1 is acquiring WL lock on /a node and wins, Thread-0 is waiting for WL on /a node.
Now Thread-1 removes /a node from cache and until now / doesn?t contains /a node, but
Thread-0 is still waiting for WL to /a node. Thread-1 commits transaction and release lock
on /a node. Thread-0 acquired lock on /a. Because /a dosen?t exists in cache (was removed
by Thread-1) this lock is invalid. This situation is handled by treecache properly,
because he checks after lock acquisition if node still exists in cache
| do{
| lock(fqn, ctx.getGlobalTransaction(), lock_type, recursive, zeroLockTimeout ? 0 :
lock_timeout, createIfNotExists, storeLockedNode, isRemoveData);
| }while (!cache.exists(fqn))
|
But what happens if between lock method and condition check another thread creates /a
node?
Assume that Thread-0 is sleeping before loop condition check (this really happens in
real-applications under heavy load). Thread-1 needs to put some data into previously
removed /a node. Because / (root) dosen?t contains /a, new node /a is created, and WL on
new node is acquired by Thread-1.
Now we have situation, when 2 threads have WL on this same node! This is possible, because
lock table is located into node, and both threads have another instance of /a node.
Now Thread-1 is putting some data into /a node and commit operations, Thread-0 checks loop
condition, and because /a node exists in cache, loop is released. Thread-0 is going into
real _put operation. Now assume, that before invoke _put operation Thread-1 is removing
again /a node. He can do this, because /a node from treecache hasn?t any information about
WL lock (Thread-0 has WL, but in old node instance, different from instance form cache).
Node /a is now removed from cache by Thread-1.
And now if Thread-0 try invoke _put operation, he can?t find /a node, findNode returns
null
and NodeNotExistsException is raised.
This issue was observed and reported as bug
http://jira.jboss.com/jira/browse/JBCACHE-1164
I think, that the simplest solution to this problem is to add additional condition to loop
condition. Not only fqn exists condition should be checked, but node from cache with
locked node should be compared. It?s easy to return locked node from lock() method and
replace cache.exists with cache._get method and compare this 2 objects. If they are not
this same object ? loop can be finished only if this objects are this same object. (or if
result from lock method is null)
I have changed this code and it look works correctly now.
Scenario 2:
Cache is clear. Thread-0 is trying to remove /a node. He checks / (root) and because there
is no any children, no WL lock was created. Because no lock has been created, no
information to current transaction table about removed node is added. (Treecache in
pessimistic locking scheme doesn?t is marking only nodes as removed when real _remove
operation is invoked. Nodes are really removed in commit phase). Tread-0 is going into
real _remove operation. Assume that Thread-0 sleeps before _remove execution (simulate
heavy-load). Now Thread-1 is trying to put new data into /a/b/c/d node. He must checks
whole path /a/b/c/d. because / (root) doesn?t contains any children, /a is created and
then /a/b etc. Now Thread-1 needs to check end loop condition, but assume, that before
this operation Thread-0 invokes _remove operation. Node /a is marked as removed, Thread-0
commit transaction etc. because transaction table doesn?t contains any information about
removed nodes node was not real removed from cache. But this node is still marked as
removed!
This is first problem in this scenario, this node is form now like ghost, because it still
exists in cache, but is marked as removed.
This issue was observed and reported as bug
http://jira.jboss.com/jira/browse/JBCACHE-1168
And we can see now next problem: Thread-1 needs to check end loop condition. Because
exists method from cache default doesn?t returns nodes marked as removed, condition is
true (!cache.exists), and again lock() method is called. Again / (root) is checked, but
because getOrCreateChild doesn?t checks node for markAsRemoved condition, now Thread-1
looks, that / contains /a, /a contains /a/b etc. again WL on /a/b/c/d is acquired and
again end loop condition is checked, again cache.exists returns false, again lock() is
invoked etc. This loop never ends. Only chance to stop this is TimeoutException, but
because Thread-1 has already owned all locks, there is no chance to TimeoutException to be
thrown.
This issue (endless loop) was observed and reported as bug
http://jira.jboss.com/jira/browse/JBCACHE-1165
I think, that because real _remove operation is always invoked (both when node is locked
and when node doesn?t exists and no lock is acquired), the simplest solution is to set
flag createIfNotExists=true on removeNodeMethod. I have added this condition and it look
works correctly.
To break infinitive loop (in another, yet unknown cases), I think that max time to spent
in this loop can?t be greater than lock acquisition timeout, so in loop check condition
should be added.
Scenario 3:
Cache is clear. Similar to scenario 2 but with small, but important difference. First
starts Thread-1. Thread-1 is putting new data into /a/b/c/d. Because WL is acquired only
on target node, on nodes /a /a/b and /a/b/c only RL is acquired. Assume, that Thread-1has
acquired RL on /a/b/c. Assume that Thread-1 sleep for some time now (simulate heavy-load).
Now Thread-0 starts and is trying to remove /a node. Thread-0 needs lock with WL /a and
/a children?s. Thread-0 locks /a, /a/b, and /a/b/c with WL (he can do this, because
Thread-1 has only RL) and because now /a/b/c node doesn?t contains any children (Thread-1
is sleeping). Thread-0 checks end loop condition and is going into real _remove operation.
Now assume that Thread-0 is going to sleep and Thread-1 wakes up. Thread-1 create /a/b/c/d
node and acquire WL on this node, and is going into real _put operation. Now if Thread-0
removes node /a/b/c/d before Thread-1 finds node /a/b/c/d, Thread-1 will throw
NodeNotExistsException from _put operation, because for Thread-1 this node is locked but
couldn?t be found in cache.
I don?t known how this situation could be the best handled. The simplest solution is to
acquire WL on node instead RL when node is created, but I have not yet tested this
solution.
I hope, that this information helps TreeCache team to solve problems with parallel work to
fast as its possible, because now it?s impossible to use Treecache (in PessimistickLock
mode) in multithread applications, where many concurrent put/remove operations from this
same regions occurs.
Jacek Halat
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4082429#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...