[jboss-user] [JBossCache] - Problems with NodeLocking algorithm

jacek187 do-not-reply at jboss.com
Sun Sep 9 18:45:06 EDT 2007


Problems with NodeLocking algorithm

Hi all

I?m fascinated with TreeCache and I think that this is a very good and powerful product. I?m trying to use it in my application where many concurrent puts/removes occurs ? and this is not easy task, because there are many unreasonable exceptions raised from TreeCache under heavy-load.

Because my application must work 24/7 without any unreasonable exceptions raised form any component I do some investigation and found, that node locking algorithm used by TreeCache (when using Pessimistic locking scheme) is not good enough.

Main ?work unit? in TreeCache is Node. Nodes have hierarchical structure like folders on disk. There is root node, root node has children, children have own children etc.
If operation needs modify data in node, lock must be acquired. Physically locks are a part of Node.
If we put data into /a/b/c/d node (on clear cache), first lock on / (root) node must be acquired. Then if / doesn?t contains a child, /a node must be created and lock must be acquired etc.
If all required nodes from path are locked with correct lock type (RL, WL), target operation can be invoked.
This is very simple and generally this works correctly in single-thread applications. But what happens when 2 threads are trying work on this same node? 

Scenario 1: 
There is /a node in cache.
Thread-0 is trying put new value into /a node, Thread-1 is trying remove /a node.
Both threads are in PessimistickLockInterceptor.lock() method, both treads have this same reference to /a node.
Thread-1 is acquiring WL lock on /a node and wins, Thread-0 is waiting for WL on /a node.
Now Thread-1 removes /a node from cache and until now / doesn?t contains /a node, but Thread-0 is still waiting for WL to /a node. Thread-1 commits transaction and release lock on /a node. Thread-0 acquired lock on /a. Because /a dosen?t exists in cache (was removed by Thread-1) this lock is invalid. This situation is handled by treecache properly, because he checks after lock acquisition if node still exists in cache

  | do{
  | lock(fqn, ctx.getGlobalTransaction(), lock_type, recursive, zeroLockTimeout ? 0 : lock_timeout, createIfNotExists, storeLockedNode, isRemoveData);
  | }while (!cache.exists(fqn))
  | 
But what happens if between lock method and condition check another thread creates /a node?

Assume that Thread-0 is sleeping before loop condition check (this really happens in real-applications under heavy load). Thread-1 needs to put some data into previously removed /a node. Because / (root) dosen?t contains /a, new node /a is created, and WL on new node is acquired by Thread-1.

Now we have situation, when 2 threads have WL on this same node! This is possible, because lock table is located into node, and both threads have another instance of /a node.
Now Thread-1 is putting some data into /a node and commit operations, Thread-0 checks loop condition, and because /a node exists in cache, loop is released. Thread-0 is going into real _put operation. Now assume, that before invoke _put operation Thread-1 is removing again /a node. He can do this, because /a node from treecache hasn?t any information about WL lock (Thread-0 has WL, but in old node instance, different from instance form cache). Node /a is now removed from cache by Thread-1. 
And now if Thread-0 try invoke _put operation, he can?t find /a node, findNode returns null
 and NodeNotExistsException is raised.

This issue was observed and reported as bug http://jira.jboss.com/jira/browse/JBCACHE-1164

I think, that the simplest solution to this problem is to add additional condition to loop condition. Not only fqn exists condition should be checked, but node from cache with locked node should be compared. It?s easy to return locked node from lock() method and replace cache.exists with cache._get method and compare this 2 objects. If they are not this same object ? loop can be finished only if this objects are this same object. (or if result from lock method is null)

I have changed this code and it look works correctly now.

Scenario 2:
Cache is clear. Thread-0 is trying to remove /a node. He checks / (root) and because there is no any children, no WL lock was created. Because no lock has been created, no information to current transaction table about removed node is added. (Treecache in pessimistic locking scheme doesn?t is marking only nodes as removed when real _remove operation is invoked. Nodes are really removed in commit phase). Tread-0 is going into real _remove operation. Assume that Thread-0 sleeps before _remove execution (simulate heavy-load). Now Thread-1 is trying to put new data into /a/b/c/d node. He must checks whole path /a/b/c/d. because / (root) doesn?t contains any children, /a is created and then /a/b etc. Now Thread-1 needs to check end loop condition, but assume, that before this operation Thread-0 invokes _remove operation. Node /a is marked as removed, Thread-0 commit transaction etc. because transaction table doesn?t contains any information about removed nodes node was not real removed from cache. But this node is still marked as removed! 
This is first problem in this scenario, this node is form now like ghost, because it still exists in cache, but is marked as removed.
This issue was observed and reported as bug http://jira.jboss.com/jira/browse/JBCACHE-1168

And we can see now next problem: Thread-1 needs to check end loop condition. Because exists method from cache default doesn?t returns nodes marked as removed, condition is true (!cache.exists), and again lock() method is called. Again / (root) is checked, but because getOrCreateChild doesn?t checks node for markAsRemoved condition, now Thread-1 looks, that / contains /a, /a contains /a/b etc. again WL on /a/b/c/d is acquired and again end loop condition is checked, again cache.exists returns false, again lock() is invoked etc. This loop never ends. Only chance to stop this is TimeoutException, but because Thread-1 has already owned all locks, there is no chance to TimeoutException to be thrown.

This issue (endless loop) was observed and reported as bug http://jira.jboss.com/jira/browse/JBCACHE-1165
I think, that because real _remove operation is always invoked (both when node is locked and when node doesn?t exists and no lock is acquired), the simplest solution is to set flag createIfNotExists=true on removeNodeMethod. I have added this condition and it look works correctly.
To break infinitive loop (in another, yet unknown cases), I think that max time to spent in this loop can?t be greater than lock acquisition timeout, so in loop check condition should be added.

Scenario 3: 
Cache is clear. Similar to scenario 2 but with small, but important difference. First starts Thread-1. Thread-1 is putting new data into /a/b/c/d. Because WL is acquired only on target node, on nodes /a /a/b and /a/b/c only RL is acquired. Assume, that Thread-1has acquired RL on /a/b/c. Assume that Thread-1 sleep for some time now (simulate heavy-load). Now Thread-0 starts and is trying to remove /a node. Thread-0 needs lock with WL  /a and /a children?s. Thread-0 locks /a, /a/b, and /a/b/c with WL (he can do this, because Thread-1 has only RL) and because now /a/b/c node doesn?t contains any children (Thread-1 is sleeping). Thread-0 checks end loop condition and is going into real _remove operation. Now assume that Thread-0 is going to sleep and Thread-1 wakes up. Thread-1 create /a/b/c/d node and acquire WL on this node, and is going into real _put operation. Now if Thread-0 removes node /a/b/c/d before Thread-1 finds node /a/b/c/d, Thread-1 will throw NodeNotExistsException from _put operation, because for Thread-1 this node is locked but couldn?t be found in cache.

I don?t known how this situation could be the best handled. The simplest solution is to acquire WL on node instead RL when node is created, but I have not yet tested this solution.

I hope, that this information helps TreeCache team to solve problems with parallel work to fast as its possible, because now it?s impossible to use Treecache (in PessimistickLock mode) in multithread applications, where many concurrent put/remove operations from this same regions occurs.

Jacek Halat

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4082429#4082429

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4082429




More information about the jboss-user mailing list