[
https://issues.jboss.org/browse/ISPN-9762?page=com.atlassian.jira.plugin....
]
Dan Berindei commented on ISPN-9762:
------------------------------------
Looks like the problem is the {{storesMutex}}, which is a read-write and favors writers
over readers. Once a writer is queued, no other readers are allowed to acquire the lock
until all the current read lock holders finish their work, so a reader cannot rely on
another reader being able to run in parallel.
This is exactly what happens in the RocksDB store, which uses a blocking queue to write
expiration metadata in a separate DB. Insertion threads write to the blocking queue while
holding {{storesMutex.readLock}} and assume that the purge thread can acquire
{{storesMutext.readLock}} in parallel and drain the queue. Once the availability check
thread tries to acquire {{storesMutex.writeLock}}, everything stops.
Luckily, {{storesMutex.writeLock}} is only used during startup, when disabling a store,
and when doing an availability check. Setting {{<persistence
availability-interval="111000">}} should effectively disable the availability
check and work around the issue.
[~ryanemerson] [~william.burns] we need to consider this when it comes to non-blocking
stores as well, maybe we can find an alternative that doesn't block the insertion
threads while doing the availability checks? Ideally I'd like to move the expiration
metadata to the main RocksDB database as well and remove the blocking queue.
Cache hangs during rebalancing
------------------------------
Key: ISPN-9762
URL:
https://issues.jboss.org/browse/ISPN-9762
Project: Infinispan
Issue Type: Bug
Affects Versions: 9.4.2.Final
Reporter: Sergey Chernolyas
Priority: Blocker
Attachments: hang_node.txt, normal_node.txt, stat_bad_node.png,
stat_good_node.png
I have a cluster with two nodes. One node starts without problem. Second node hangs on
rebalancing cache DEVICES.
Configuration of the cache:
{code:xml}
<distributed-cache name="DEVICES" owners="2"
segments="256" mode="SYNC">
<state-transfer await-initial-transfer="true"
enabled="true" timeout="2400000" chunk-size="2048"/>
<partition-handling when-split="ALLOW_READ_WRITES"
merge-policy="PREFERRED_ALWAYS"/>
<memory>
<object size="300000"
strategy="REMOVE"/>
</memory>
<rocksdb-store preload="true"
path="/data/rocksdb/devices/data">
<expiration
path="/data/rocksdb/devices/expired"/>
</rocksdb-store>
<indexing index="LOCAL">
<property
name="default.indexmanager">org.infinispan.query.indexmanager.InfinispanIndexManager</property>
<property
name="default.directory_provider">infinispan</property>
<property
name="default.worker.execution">async</property>
<property
name="default.index_flush_interval">500</property>
<property
name="default.indexwriter.merge_factor">30</property>
<property
name="default.indexwriter.merge_max_size">1024</property>
<property
name="default.indexwriter.ram_buffer_size">256</property>
<property
name="default.locking_cachename">LuceneIndexesLocking_devices</property>
<property
name="default.data_cachename">LuceneIndexesData_devices</property>
<property
name="default.metadata_cachename">LuceneIndexesMetadata_devices</property>
</indexing>
<expiration max-idle="172800000"/>
</distributed-cache>
{code}
The cache contains 70 000 elements.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)