[JBoss JIRA] (ISPN-9483) TEST_PING doesn't trigger merge after JGroups 4.0.13 upgrade
by Diego Lovison (Jira)
[ https://issues.jboss.org/browse/ISPN-9483?page=com.atlassian.jira.plugin.... ]
Diego Lovison closed ISPN-9483.
-------------------------------
> TEST_PING doesn't trigger merge after JGroups 4.0.13 upgrade
> ------------------------------------------------------------
>
> Key: ISPN-9483
> URL: https://issues.jboss.org/browse/ISPN-9483
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.4.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: on-hold, testsuite_stability
> Fix For: 9.4.0.CR3
>
>
> In JGroups 4.0.13.Final, {{MERGE3}} started using the {{ASYNC_DISCOVERY_EVENT}} to find other members. {{TEST_PING}} doesn't handle the event correctly, at least when trace logging is enabled, and the merge never happens.
> {{Discovery}} should handle the new event automatically, but it only works if the discovery protocol actively sends out {{GET_MBRS_REQ}} messages and receives {{GET_MBRS_RSP}} messages from other members. {{TEST_PING}} doesn't receive any {{GET_MBRS_RSP}} messages, so {{Discovery.addResponse()}} is never called.
> This causes failures in all the tests that split the cluster and heal it, but for some reason CI isn't reporting the failures:
> {noformat}
> [OK: 70, KO: 1, SKIP: 0] Test failed: org.infinispan.distribution.rehash.RehashAfterPartitionMergeTest.testCachePartition[DIST_SYNC]
> java.lang.RuntimeException: Timed out before caches had changed views ([[RehashAfterPartitionMergeTest[DIST_SYNC]-NodeB-45390], [RehashAfterPartitionMergeTest[DIST_SYNC]-NodeD-46782]]) to contain 2 members
> at org.infinispan.test.TestingUtil.blockUntilViewsChanged(TestingUtil.java:761)
> at org.infinispan.test.TestingUtil.blockUntilViewsChanged(TestingUtil.java:743)
> at org.infinispan.distribution.rehash.RehashAfterPartitionMergeTest.testCachePartition(RehashAfterPartitionMergeTest.java:67)
> {noformat}
> https://ci.infinispan.org/job/Infinispan/job/master/808/consoleFull
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9483) TEST_PING doesn't trigger merge after JGroups 4.0.13 upgrade
by Diego Lovison (Jira)
[ https://issues.jboss.org/browse/ISPN-9483?page=com.atlassian.jira.plugin.... ]
Diego Lovison updated ISPN-9483:
--------------------------------
Tester: Diego Lovison
> TEST_PING doesn't trigger merge after JGroups 4.0.13 upgrade
> ------------------------------------------------------------
>
> Key: ISPN-9483
> URL: https://issues.jboss.org/browse/ISPN-9483
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.4.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: on-hold, testsuite_stability
> Fix For: 9.4.0.CR3
>
>
> In JGroups 4.0.13.Final, {{MERGE3}} started using the {{ASYNC_DISCOVERY_EVENT}} to find other members. {{TEST_PING}} doesn't handle the event correctly, at least when trace logging is enabled, and the merge never happens.
> {{Discovery}} should handle the new event automatically, but it only works if the discovery protocol actively sends out {{GET_MBRS_REQ}} messages and receives {{GET_MBRS_RSP}} messages from other members. {{TEST_PING}} doesn't receive any {{GET_MBRS_RSP}} messages, so {{Discovery.addResponse()}} is never called.
> This causes failures in all the tests that split the cluster and heal it, but for some reason CI isn't reporting the failures:
> {noformat}
> [OK: 70, KO: 1, SKIP: 0] Test failed: org.infinispan.distribution.rehash.RehashAfterPartitionMergeTest.testCachePartition[DIST_SYNC]
> java.lang.RuntimeException: Timed out before caches had changed views ([[RehashAfterPartitionMergeTest[DIST_SYNC]-NodeB-45390], [RehashAfterPartitionMergeTest[DIST_SYNC]-NodeD-46782]]) to contain 2 members
> at org.infinispan.test.TestingUtil.blockUntilViewsChanged(TestingUtil.java:761)
> at org.infinispan.test.TestingUtil.blockUntilViewsChanged(TestingUtil.java:743)
> at org.infinispan.distribution.rehash.RehashAfterPartitionMergeTest.testCachePartition(RehashAfterPartitionMergeTest.java:67)
> {noformat}
> https://ci.infinispan.org/job/Infinispan/job/master/808/consoleFull
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9762) Cache hangs during rebalancing
by Sergey Chernolyas (Jira)
[ https://issues.jboss.org/browse/ISPN-9762?page=com.atlassian.jira.plugin.... ]
Sergey Chernolyas commented on ISPN-9762:
-----------------------------------------
Link to trace log https://yadi.sk/d/5OW7GafEHvXdWQ
> Cache hangs during rebalancing
> ------------------------------
>
> Key: ISPN-9762
> URL: https://issues.jboss.org/browse/ISPN-9762
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 9.4.2.Final
> Reporter: Sergey Chernolyas
> Priority: Blocker
> Attachments: hang_node.txt, normal_node.txt, stat_bad_node.png, stat_good_node.png
>
>
> I have a cluster with two nodes. One node starts without problem. Second node hangs on rebalancing cache DEVICES.
> Configuration of the cache:
> {code:xml}
> <distributed-cache name="DEVICES" owners="2" segments="256" mode="SYNC">
> <state-transfer await-initial-transfer="true" enabled="true" timeout="2400000" chunk-size="2048"/>
> <partition-handling when-split="ALLOW_READ_WRITES" merge-policy="PREFERRED_ALWAYS"/>
> <memory>
> <object size="300000" strategy="REMOVE"/>
> </memory>
> <rocksdb-store preload="true" path="/data/rocksdb/devices/data">
> <expiration path="/data/rocksdb/devices/expired"/>
> </rocksdb-store>
> <indexing index="LOCAL">
> <property name="default.indexmanager">org.infinispan.query.indexmanager.InfinispanIndexManager</property>
> <property name="default.directory_provider">infinispan</property>
> <property name="default.worker.execution">async</property>
> <property name="default.index_flush_interval">500</property>
> <property name="default.indexwriter.merge_factor">30</property>
> <property name="default.indexwriter.merge_max_size">1024</property>
> <property name="default.indexwriter.ram_buffer_size">256</property>
> <property name="default.locking_cachename">LuceneIndexesLocking_devices</property>
> <property name="default.data_cachename">LuceneIndexesData_devices</property>
> <property name="default.metadata_cachename">LuceneIndexesMetadata_devices</property>
> </indexing>
> <expiration max-idle="172800000"/>
> </distributed-cache>
> {code}
> The cache contains 70 000 elements.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9764) Server should utilise the Persistence element
by Ryan Emerson (Jira)
Ryan Emerson created ISPN-9764:
----------------------------------
Summary: Server should utilise the Persistence element
Key: ISPN-9764
URL: https://issues.jboss.org/browse/ISPN-9764
Project: Infinispan
Issue Type: Bug
Components: Server
Affects Versions: 9.4.2.Final
Reporter: Ryan Emerson
Assignee: Ryan Emerson
Fix For: 10.0.0.Alpha2, 9.4.3.Final
Currently the <persistence> element is not required to configure stores in the server. We need to add this, so that we can configure PersistenceManager characteristics such as store-availability etc. Furthermore, with the current config it's possible for two store configs to cancel out each other's passivation behaviour if one has passivation=true and passivation=false.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9496) Some xsite tests hang during teardown
by Diego Lovison (Jira)
[ https://issues.jboss.org/browse/ISPN-9496?page=com.atlassian.jira.plugin.... ]
Diego Lovison closed ISPN-9496.
-------------------------------
> Some xsite tests hang during teardown
> -------------------------------------
>
> Key: ISPN-9496
> URL: https://issues.jboss.org/browse/ISPN-9496
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.4.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: on-hold, testsuite_stability
> Fix For: 9.4.0.CR3
>
>
> {noformat}
> Test org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.clearContent has been running for more than 300 seconds. Interrupting the test thread and dumping thread stacks of the test suite process and its children.
> Test org.infinispan.xsite.CacheOperationsTest.destroy has been running for more than 300 seconds. Interrupting the test thread and dumping thread stacks of the test suite process and its children.
> ...
> Killed processes 16913
> The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
> Error occurred in starting fork, check output in log
> Process Exit Code: 143
> Crashed tests:
> org.infinispan.eviction.impl.ExceptionEvictionTest
> org.infinispan.statetransfer.ClusterTopologyManagerTest
> org.infinispan.stream.LocalStreamOffHeapTest
> {noformat}
> The timeouts are very likely caused by the JGRP-2277 changes. Most of our tests run without any FD* protocol to avoid creating an extra socket + thread, so when the coordinator leaves, the 2nd node *must* receive the leave message from the coordinator or it will never install a view with itself as the coordinator.
> This dependency still existed before JGRP-2277, but it appears the view message sent by the coordinator before leaving was somehow more likely to reach the 2nd node than the new leave message.
> The "crashed tests" list only includes tests that we know take a very long time to run, so I am assuming that they're not relevant. And unfortunately the mechanism to interrupt long tests still isn't working as it should, the thread dumps are not included in the artifacts.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9762) Cache hangs during rebalancing
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9762?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-9762:
------------------------------------
[~ryanemerson] looks like there's no way to set the availability interval in the server, as the {{<persistence>}} element is missing.
> Cache hangs during rebalancing
> ------------------------------
>
> Key: ISPN-9762
> URL: https://issues.jboss.org/browse/ISPN-9762
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 9.4.2.Final
> Reporter: Sergey Chernolyas
> Priority: Blocker
> Attachments: hang_node.txt, normal_node.txt, stat_bad_node.png, stat_good_node.png
>
>
> I have a cluster with two nodes. One node starts without problem. Second node hangs on rebalancing cache DEVICES.
> Configuration of the cache:
> {code:xml}
> <distributed-cache name="DEVICES" owners="2" segments="256" mode="SYNC">
> <state-transfer await-initial-transfer="true" enabled="true" timeout="2400000" chunk-size="2048"/>
> <partition-handling when-split="ALLOW_READ_WRITES" merge-policy="PREFERRED_ALWAYS"/>
> <memory>
> <object size="300000" strategy="REMOVE"/>
> </memory>
> <rocksdb-store preload="true" path="/data/rocksdb/devices/data">
> <expiration path="/data/rocksdb/devices/expired"/>
> </rocksdb-store>
> <indexing index="LOCAL">
> <property name="default.indexmanager">org.infinispan.query.indexmanager.InfinispanIndexManager</property>
> <property name="default.directory_provider">infinispan</property>
> <property name="default.worker.execution">async</property>
> <property name="default.index_flush_interval">500</property>
> <property name="default.indexwriter.merge_factor">30</property>
> <property name="default.indexwriter.merge_max_size">1024</property>
> <property name="default.indexwriter.ram_buffer_size">256</property>
> <property name="default.locking_cachename">LuceneIndexesLocking_devices</property>
> <property name="default.data_cachename">LuceneIndexesData_devices</property>
> <property name="default.metadata_cachename">LuceneIndexesMetadata_devices</property>
> </indexing>
> <expiration max-idle="172800000"/>
> </distributed-cache>
> {code}
> The cache contains 70 000 elements.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9762) Cache hangs during rebalancing
by Dan Berindei (Jira)
[ https://issues.jboss.org/browse/ISPN-9762?page=com.atlassian.jira.plugin.... ]
Dan Berindei commented on ISPN-9762:
------------------------------------
Looks like the problem is the {{storesMutex}}, which is a read-write and favors writers over readers. Once a writer is queued, no other readers are allowed to acquire the lock until all the current read lock holders finish their work, so a reader cannot rely on another reader being able to run in parallel.
This is exactly what happens in the RocksDB store, which uses a blocking queue to write expiration metadata in a separate DB. Insertion threads write to the blocking queue while holding {{storesMutex.readLock}} and assume that the purge thread can acquire {{storesMutext.readLock}} in parallel and drain the queue. Once the availability check thread tries to acquire {{storesMutex.writeLock}}, everything stops.
Luckily, {{storesMutex.writeLock}} is only used during startup, when disabling a store, and when doing an availability check. Setting {{<persistence availability-interval="111000">}} should effectively disable the availability check and work around the issue.
[~ryanemerson] [~william.burns] we need to consider this when it comes to non-blocking stores as well, maybe we can find an alternative that doesn't block the insertion threads while doing the availability checks? Ideally I'd like to move the expiration metadata to the main RocksDB database as well and remove the blocking queue.
> Cache hangs during rebalancing
> ------------------------------
>
> Key: ISPN-9762
> URL: https://issues.jboss.org/browse/ISPN-9762
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 9.4.2.Final
> Reporter: Sergey Chernolyas
> Priority: Blocker
> Attachments: hang_node.txt, normal_node.txt, stat_bad_node.png, stat_good_node.png
>
>
> I have a cluster with two nodes. One node starts without problem. Second node hangs on rebalancing cache DEVICES.
> Configuration of the cache:
> {code:xml}
> <distributed-cache name="DEVICES" owners="2" segments="256" mode="SYNC">
> <state-transfer await-initial-transfer="true" enabled="true" timeout="2400000" chunk-size="2048"/>
> <partition-handling when-split="ALLOW_READ_WRITES" merge-policy="PREFERRED_ALWAYS"/>
> <memory>
> <object size="300000" strategy="REMOVE"/>
> </memory>
> <rocksdb-store preload="true" path="/data/rocksdb/devices/data">
> <expiration path="/data/rocksdb/devices/expired"/>
> </rocksdb-store>
> <indexing index="LOCAL">
> <property name="default.indexmanager">org.infinispan.query.indexmanager.InfinispanIndexManager</property>
> <property name="default.directory_provider">infinispan</property>
> <property name="default.worker.execution">async</property>
> <property name="default.index_flush_interval">500</property>
> <property name="default.indexwriter.merge_factor">30</property>
> <property name="default.indexwriter.merge_max_size">1024</property>
> <property name="default.indexwriter.ram_buffer_size">256</property>
> <property name="default.locking_cachename">LuceneIndexesLocking_devices</property>
> <property name="default.data_cachename">LuceneIndexesData_devices</property>
> <property name="default.metadata_cachename">LuceneIndexesMetadata_devices</property>
> </indexing>
> <expiration max-idle="172800000"/>
> </distributed-cache>
> {code}
> The cache contains 70 000 elements.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9496) Some xsite tests hang during teardown
by Diego Lovison (Jira)
[ https://issues.jboss.org/browse/ISPN-9496?page=com.atlassian.jira.plugin.... ]
Diego Lovison updated ISPN-9496:
--------------------------------
Tester: Diego Lovison
> Some xsite tests hang during teardown
> -------------------------------------
>
> Key: ISPN-9496
> URL: https://issues.jboss.org/browse/ISPN-9496
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.4.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Labels: on-hold, testsuite_stability
> Fix For: 9.4.0.CR3
>
>
> {noformat}
> Test org.infinispan.xsite.statetransfer.failures.RetryMechanismTest.clearContent has been running for more than 300 seconds. Interrupting the test thread and dumping thread stacks of the test suite process and its children.
> Test org.infinispan.xsite.CacheOperationsTest.destroy has been running for more than 300 seconds. Interrupting the test thread and dumping thread stacks of the test suite process and its children.
> ...
> Killed processes 16913
> The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
> Error occurred in starting fork, check output in log
> Process Exit Code: 143
> Crashed tests:
> org.infinispan.eviction.impl.ExceptionEvictionTest
> org.infinispan.statetransfer.ClusterTopologyManagerTest
> org.infinispan.stream.LocalStreamOffHeapTest
> {noformat}
> The timeouts are very likely caused by the JGRP-2277 changes. Most of our tests run without any FD* protocol to avoid creating an extra socket + thread, so when the coordinator leaves, the 2nd node *must* receive the leave message from the coordinator or it will never install a view with itself as the coordinator.
> This dependency still existed before JGRP-2277, but it appears the view message sent by the coordinator before leaving was somehow more likely to reach the 2nd node than the new leave message.
> The "crashed tests" list only includes tests that we know take a very long time to run, so I am assuming that they're not relevant. And unfortunately the mechanism to interrupt long tests still isn't working as it should, the thread dumps are not included in the artifacts.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months