[infinispan-issues] [JBoss JIRA] (ISPN-6239) InitialClusterSizeTest.testInitialClusterSizeFail random failures

Wed Mar 16 07:56:00 EDT 2016

     [ https://issues.jboss.org/browse/ISPN-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Berindei reopened ISPN-6239:
--------------------------------

The test is still failing randomly:

{noformat}
10:17:52,124 DEBUG (ForkThread-1,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeF-44973: installing view [InitialClusterSizeTest-NodeF-44973|0] (1) [InitialClusterSizeTest-NodeF-44973]
10:17:52,124 INFO  (ForkThread-1,InitialClusterSizeTest) [JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [InitialClusterSizeTest-NodeF-44973|0] (1) [InitialClusterSizeTest-NodeF-44973]
10:17:52,126 DEBUG (ForkThread-1,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeF-44973: created cluster (first member). My view is [InitialClusterSizeTest-NodeF-44973|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
10:17:52,215 DEBUG (ForkThread-4,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeE-6608: sending JOIN(InitialClusterSizeTest-NodeE-6608) to InitialClusterSizeTest-NodeF-44973
10:17:52,215 DEBUG (ForkThread-2,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeG-58252: sending JOIN(InitialClusterSizeTest-NodeG-58252) to InitialClusterSizeTest-NodeF-44973
10:17:52,762 DEBUG (Incoming-1,InitialClusterSizeTest-NodeF-44973) [GMS] InitialClusterSizeTest-NodeF-44973: installing view [InitialClusterSizeTest-NodeF-44973|1] (3) [InitialClusterSizeTest-NodeF-44973, InitialClusterSizeTest-NodeE-6608, InitialClusterSizeTest-NodeG-58252]
10:17:53,261 DEBUG (ForkThread-4,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeE-6608: installing view [InitialClusterSizeTest-NodeF-44973|1] (3) [InitialClusterSizeTest-NodeF-44973, InitialClusterSizeTest-NodeE-6608, InitialClusterSizeTest-NodeG-58252]
10:17:53,355 DEBUG (ForkThread-2,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeG-58252: installing view [InitialClusterSizeTest-NodeF-44973|1] (3) [InitialClusterSizeTest-NodeF-44973, InitialClusterSizeTest-NodeE-6608, InitialClusterSizeTest-NodeG-58252]
10:17:57,736 ERROR (testng-InitialClusterSizeTest) [UnitTestTestNGListener] Test testInitialClusterSizeFail(org.infinispan.remoting.transport.InitialClusterSizeTest) failed.
org.testng.TestException: 
Expected exception org.infinispan.commons.CacheException but got java.util.concurrent.TimeoutException
{noformat}

Here {{TEST_PING}} seems to be working fine, but it still takes > 1 second for all the nodes to receive their initial view. Because the {{initialClusterTimeout}} timeout starts after the initial view, some nodes will time out 1 second later, and because the test expects all the nodes to time out in {{initialClusterTimeout + 1s}}, it will fail.

> InitialClusterSizeTest.testInitialClusterSizeFail random failures
> -----------------------------------------------------------------
>
>                 Key: ISPN-6239
>                 URL: https://issues.jboss.org/browse/ISPN-6239
>             Project: Infinispan
>          Issue Type: Bug
>          Components: Test Suite - Core
>    Affects Versions: 8.2.0.Beta2
>            Reporter: Dan Berindei
>            Assignee: Dan Berindei
>              Labels: testsuite_failure
>             Fix For: 8.2.0.CR1, 8.2.0.Final
>
>
> The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1 second.
> However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover and start the cluster properly.
> The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the protocol stack, the partitions would never merge and the test would fail with a timeout.
> I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)