Hi,
We have isolated what we think is a synchronization issue during data gravitation over
multiple nodes using buddy replication.We have a unit test demonstrating the issue which I
can send to anyone interested.
What appears to happen is this: When two nodes are involved in a data gravitation
sometimes multiple data gravitation cleanup commands are issued of which one blocks the
other. The calling node then times out after "buddyCommunicationTimeout"
milliseconds (a timeout which is only logged as debug?) and returns null, making it look
like the requested data does not exist in the cache. Further investigation reveals two
global transactions on the data holding cache, one which holds an identity lock (write)
for a backup data node and the other waiting to lock the same node.
Depending on "LockAcquisitionTimeout" the blocked request may continue, but we
have seen several consequences of this depending on whether a user transaction is involved
or not. Sometimes the lock seems to disappear and sometimes it doesn't and the
application is in effect completely blocked (as the jgroup thread will be holding a lock
on a NakReceiverWindow).
This behavior only occurs (as far as we've seen) when there's quite a bit of
concurrent access, in particular: when data is added to one node but accessed immediately
on another, ie. when addition and gravitation occurs immediately. We have tried disabling
auto gravitation and generally playing around with the configuration but with no effect.
This seems like a synchronization issue and the unit test I can send along also shows that
it is intermittent, sometimes the test will go through only to fail the next time and
sometime a particular test fails when run standalone only to succeed if another test was
run immediately before making it look like burn-in affects the result (which it of course
may do).
Our unit test tries to model high concurrency in three different variations, one with a
simple data gravitation, one with gravitation followed by a modification (subsequent put
on the cache) and one with gravitation and modification within a user transaction. The
last of these scenarios is basically what our real application is doing.
I've spent considerable time looking at this so feel free to ask questions. Also, tell
me where to send the unit test if you want to have a look at it. The unit test repeatedly
fails on a java5/linux/dual core/cache 2.5.0.GA setup.
Regards
/Lars J. Nilsson
www.cubeia.com
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4089200#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...