[infinispan-issues] [JBoss JIRA] (ISPN-9762) Cache hangs during rebalancing

Thu Nov 22 08:20:00 EST 2018

    [ https://issues.jboss.org/browse/ISPN-9762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665166#comment-13665166 ] 

Dan Berindei commented on ISPN-9762:
------------------------------------

Looks like the problem is the {{storesMutex}}, which is a read-write and favors writers over readers. Once a writer is queued, no other readers are allowed to acquire the lock until all the current read lock holders finish their work, so a reader cannot rely on another reader being able to run in parallel.

This is exactly what happens in the RocksDB store, which uses a blocking queue to write expiration metadata in a separate DB. Insertion threads write to the blocking queue while holding {{storesMutex.readLock}} and assume that the purge thread can acquire {{storesMutext.readLock}} in parallel and drain the queue. Once the availability check thread tries to acquire {{storesMutex.writeLock}}, everything stops.

Luckily, {{storesMutex.writeLock}} is only used during startup, when disabling a store, and when doing an availability check. Setting {{<persistence availability-interval="111000">}} should effectively disable the availability check and work around the issue.

[~ryanemerson] [~william.burns] we need to consider this when it comes to non-blocking stores as well, maybe we can find an alternative that doesn't block the insertion threads while doing the availability checks? Ideally I'd like to move the expiration metadata to the main RocksDB database as well and remove the blocking queue.

> Cache hangs during rebalancing
> ------------------------------
>
>                 Key: ISPN-9762
>                 URL: https://issues.jboss.org/browse/ISPN-9762
>             Project: Infinispan
>          Issue Type: Bug
>    Affects Versions: 9.4.2.Final
>            Reporter: Sergey Chernolyas
>            Priority: Blocker
>         Attachments: hang_node.txt, normal_node.txt, stat_bad_node.png, stat_good_node.png
>
>
> I have a cluster with two nodes. One node starts without problem. Second node hangs on rebalancing  cache DEVICES.
> Configuration of the cache:
> {code:xml}
>   <distributed-cache name="DEVICES" owners="2" segments="256"  mode="SYNC">
>                     <state-transfer await-initial-transfer="true" enabled="true" timeout="2400000" chunk-size="2048"/>
>                     <partition-handling when-split="ALLOW_READ_WRITES" merge-policy="PREFERRED_ALWAYS"/>
>                     <memory>
>                         <object size="300000" strategy="REMOVE"/>
>                     </memory>
>                     <rocksdb-store preload="true" path="/data/rocksdb/devices/data">
>                         <expiration path="/data/rocksdb/devices/expired"/>
>                     </rocksdb-store>
>                     <indexing index="LOCAL">
>                         <property name="default.indexmanager">org.infinispan.query.indexmanager.InfinispanIndexManager</property>
>                         <property name="default.directory_provider">infinispan</property>
>                         <property name="default.worker.execution">async</property>
>                         <property name="default.index_flush_interval">500</property>
>                         <property name="default.indexwriter.merge_factor">30</property>
>                         <property name="default.indexwriter.merge_max_size">1024</property>
>                         <property name="default.indexwriter.ram_buffer_size">256</property>
>                         <property name="default.locking_cachename">LuceneIndexesLocking_devices</property>
>                         <property name="default.data_cachename">LuceneIndexesData_devices</property>
>                         <property name="default.metadata_cachename">LuceneIndexesMetadata_devices</property>
>                     </indexing>
>                     <expiration max-idle="172800000"/>
>                 </distributed-cache>
> {code}
> The cache contains 70 000 elements.

--
This message was sent by Atlassian Jira
(v7.12.1#712002)