On 27 Jun 2013, at 16:18, William Burns <mudokonman(a)gmail.com> wrote:
First off I apologize for the length.
There have been a few Jiras recently that have identified L1 consistency issues with both
TX and non TX sync caches. Async caches with L1 have their own issues as well, but I only
wanted to talk about sync caches.
https://issues.jboss.org/browse/ISPN-3197
https://issues.jboss.org/browse/ISPN-2965
https://issues.jboss.org/browse/ISPN-2990
I have proposed a solution in
https://github.com/infinispan/infinispan/pull/1922 which
should start L1 consistency down the right track. There are quite a few comments on it if
you want to look into it more, but because of that I am moving this to the dev mailing
list.
The key changes in the PR are the following (non-tx):
1. Concurrent reads for a key that can retrieve a remote value are "corralled"
into a single thread of execution for that given key. This would reduce network traffic
with concurrent gets for the same key. Note the "corralling" only happens on a
per key basis.
2. The single thread that is doing the remote get would update the L1 if able (without
locking) and make available the value to all the requests waiting on the get.
3. Invalidations that are received would first check to see if there is a current remote
get occurring for it's keys. If there is it will attempt to cancel the L1 write(s)
before it occurs. If it cannot cancel the L1 write, then it must also wait on the current
remote get completion and subsequently run the invalidation. Note the cancellation would
fail when the remote get was done and it is in the middle of updating the L1, so this
would be very small window.
if the cancelation succeeds, what happens with the threads that were actually doing the
remote get? would they retry or would their operation fail?
4. Local writes will also do the same thing as the invalidation with
cancelling or waiting. Note that non tx local writes only do L1 invalidations and
don't write the value to the data container. Reasons why I found at
https://issues.jboss.org/browse/ISPN-3214
do local writes really need to cancel L1 gets as well? Surely the originator would send an
invalidate at a further point in time, when the local write is received. Or is it possible
for this invalidation message to be received before the ongoing get is finished?
5. Writes that require the previous value and don't have it in
the L1 would also do it's get operations using the same "corralling"
method.
4/5 are not currently implemented in PR.
This approach would use no locking for non tx caches for all L1 operations. The
synchronization point would be done through the "corralling" method and
invalidations/writes communicating to it.
Transactional caches would do almost the same thing as non-tx. Note these changes are
not done in any way yet.
1. Gets would now update the L1 immediately after retrieving the value without locking,
but still using the "corralling" technique that non-tx does. Previously the L1
update from a get was transactional. This actually would remedy issue [1]
2. Writes currently acquire the remote lock when committing, which is why tx caches are
able to update the L1 with the value. Writes would do the same cancellation/wait method
as non-tx.
3. Writes that require the previous value and don't have it in the L1 would also do
it's get operations using the same method.
4. For tx cache [2] would also have to be done.
[1] -
https://issues.jboss.org/browse/ISPN-2965?focusedCommentId=12779780&p...
[2] -
https://issues.jboss.org/browse/ISPN-1540
Also rehashing is another issue, but we should be able to acquire the state transfer lock
before updating the L1 on a get, just like when an entry is committed to the data
container.
Any comments/concerns would be appreciated.
Cheers,
--
Mircea Markus
Infinispan lead (
www.infinispan.org)