]
Tristan Tarrant updated ISPN-6239:
----------------------------------
Status: Resolved (was: Pull Request Sent)
Fix Version/s: 8.2.0.Final
Resolution: Done
InitialClusterSizeTest.testInitialClusterSizeFail random failures
-----------------------------------------------------------------
Key: ISPN-6239
URL:
https://issues.jboss.org/browse/ISPN-6239
Project: Infinispan
Issue Type: Bug
Components: Test Suite - Core
Affects Versions: 8.2.0.Beta2
Reporter: Dan Berindei
Assignee: Dan Berindei
Labels: testsuite_failure
Fix For: 8.2.0.CR1, 8.2.0.Final
The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of
4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1
second.
However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as
coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover
and start the cluster properly.
The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was
that the first node (A) to start would install a view with itself as the single node, but
the second node to start (B) would start immediately, and the discovery request from B
would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself
as the coordinator based on the order of A's and B's UUIDs, and the cluster would
start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the
protocol stack, the partitions would never merge and the test would fail with a timeout.
I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response
is a coordinator, when there is a single response. This worked because all but a few tests
start their managers sequentially, however it sometimes introduces this 3 seconds delay
when nodes start in parallel.