[infinispan-issues] [JBoss JIRA] (ISPN-6239) InitialClusterSizeTest.testInitialClusterSizeFail random failures
Dan Berindei (JIRA)
issues at jboss.org
Mon Feb 22 04:47:00 EST 2016
[ https://issues.jboss.org/browse/ISPN-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dan Berindei updated ISPN-6239:
-------------------------------
Description:
The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1 second.
However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover and start the cluster properly.
The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the protocol stack, the partitions would never merge and the test would fail with a timeout.
I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.
was:
The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1 second.
However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover and start the cluster properly.
The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the protocol stack, the partitions would never merge and the test would fail with a timeout.
I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.
The ISPN-5106 fix changed TEST_PING to assume the other node is coordinator, if there was only one ping response (forceCoordSingleMember).
If 2 nodes start up in parallel and each node thinks the other is coordinator, both will try to send a JOIN request to the other and time out after 3 seconds.
Because the timeout happens before we start waiting for the initial members, it's not included in the initialClusterTimeout, and it causes random failures in InitialClusterSizeTest.testInitialClusterSizeFail.
> InitialClusterSizeTest.testInitialClusterSizeFail random failures
> -----------------------------------------------------------------
>
> Key: ISPN-6239
> URL: https://issues.jboss.org/browse/ISPN-6239
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 8.2.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_failure
> Fix For: 8.2.0.CR1
>
>
> The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1 second.
> However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover and start the cluster properly.
> The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the protocol stack, the partitions would never merge and the test would fail with a timeout.
> I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
More information about the infinispan-issues
mailing list