[
https://issues.jboss.org/browse/ISPN-4777?page=com.atlassian.jira.plugin....
]
Gustavo Fernandes commented on ISPN-4777:
-----------------------------------------
Hi Anuj, thanks for providing the testcase!
Unfortunately I found an issue with the supplied test: it basically uses a
{{DistributedSegmentReadLocker}} to acquire a lock in a file that has size 0 and
bufferSize 10, and after that uses the same {{DistributedSegmentReadLocker}} to release
this lock. The trouble is that path is never really followed during normal Lucene
Directory execution, since the file having size 0 is less than the buffer size of 10, so
it is not eligible to be broken into chunks: so no read lock will be ever
acquired/released.
But since the test does artificially acquires the lock, when the call to
{{deleteOrReleaseReadLock()}} happens, the Lucene directory will ALWAYS delete it because
the file is single chunked,
and the test always fail regardless of REPL/DIST single/multiple threads.
Anyway, I pushed an updated test at
https://github.com/gustavonalle/infinispan/commit/e6a2ccd93fc60250d3a5149...
on top of 7.2.x branch and been trying to reproduce the issue with it.
Replace command not atomic in REPL_SYNC cache mode
--------------------------------------------------
Key: ISPN-4777
URL:
https://issues.jboss.org/browse/ISPN-4777
Project: Infinispan
Issue Type: Bug
Affects Versions: 7.2.4.Final
Reporter: Anuj Shah
Assignee: Gustavo Fernandes
Attachments: ReaderLockerTest.java
This problem was discovered using the Lucene InfinispanDirectory with
DistributedSegmentReadLocker. We found after a while of production usage that some Lucene
files were randomly removed from the caches, but remained in the file listing entry, which
resulted in an unusable index.
We managed to replicate the problem in a test that acquires and releases read lock
concurrently and checks for file deletion. We found this fails quickly when using
REPL_SYNC mode, but runs for a while with DIST_SYNC.
Some extra logging indicated that the replace command used to increment the lock counter
across multiple cluster members, results in an single increment when called concurrently,
with both calls reporting success. This eventually causes the file deletion, as we have
now mis-counted the number of readers. We also observed the opposite effect of the counter
only decrementing by one when releasing.
Our conclusion is that the replace command fails atomicity when in REPL_SYNC mode, but
works in other modes, we tried DIST_SYNC, DIST_ASYNC and REPL_ASYNC.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)