[infinispan-issues] [JBoss JIRA] (ISPN-6239) InitialClusterSizeTest.testInitialClusterSizeFail random failures
Dan Berindei (JIRA)
issues at jboss.org
Wed Mar 16 07:56:00 EDT 2016
[ https://issues.jboss.org/browse/ISPN-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dan Berindei reopened ISPN-6239:
--------------------------------
The test is still failing randomly:
{noformat}
10:17:52,124 DEBUG (ForkThread-1,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeF-44973: installing view [InitialClusterSizeTest-NodeF-44973|0] (1) [InitialClusterSizeTest-NodeF-44973]
10:17:52,124 INFO (ForkThread-1,InitialClusterSizeTest) [JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [InitialClusterSizeTest-NodeF-44973|0] (1) [InitialClusterSizeTest-NodeF-44973]
10:17:52,126 DEBUG (ForkThread-1,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeF-44973: created cluster (first member). My view is [InitialClusterSizeTest-NodeF-44973|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
10:17:52,215 DEBUG (ForkThread-4,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeE-6608: sending JOIN(InitialClusterSizeTest-NodeE-6608) to InitialClusterSizeTest-NodeF-44973
10:17:52,215 DEBUG (ForkThread-2,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeG-58252: sending JOIN(InitialClusterSizeTest-NodeG-58252) to InitialClusterSizeTest-NodeF-44973
10:17:52,762 DEBUG (Incoming-1,InitialClusterSizeTest-NodeF-44973) [GMS] InitialClusterSizeTest-NodeF-44973: installing view [InitialClusterSizeTest-NodeF-44973|1] (3) [InitialClusterSizeTest-NodeF-44973, InitialClusterSizeTest-NodeE-6608, InitialClusterSizeTest-NodeG-58252]
10:17:53,261 DEBUG (ForkThread-4,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeE-6608: installing view [InitialClusterSizeTest-NodeF-44973|1] (3) [InitialClusterSizeTest-NodeF-44973, InitialClusterSizeTest-NodeE-6608, InitialClusterSizeTest-NodeG-58252]
10:17:53,355 DEBUG (ForkThread-2,InitialClusterSizeTest) [GMS] InitialClusterSizeTest-NodeG-58252: installing view [InitialClusterSizeTest-NodeF-44973|1] (3) [InitialClusterSizeTest-NodeF-44973, InitialClusterSizeTest-NodeE-6608, InitialClusterSizeTest-NodeG-58252]
10:17:57,736 ERROR (testng-InitialClusterSizeTest) [UnitTestTestNGListener] Test testInitialClusterSizeFail(org.infinispan.remoting.transport.InitialClusterSizeTest) failed.
org.testng.TestException:
Expected exception org.infinispan.commons.CacheException but got java.util.concurrent.TimeoutException
{noformat}
Here {{TEST_PING}} seems to be working fine, but it still takes > 1 second for all the nodes to receive their initial view. Because the {{initialClusterTimeout}} timeout starts after the initial view, some nodes will time out 1 second later, and because the test expects all the nodes to time out in {{initialClusterTimeout + 1s}}, it will fail.
> InitialClusterSizeTest.testInitialClusterSizeFail random failures
> -----------------------------------------------------------------
>
> Key: ISPN-6239
> URL: https://issues.jboss.org/browse/ISPN-6239
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 8.2.0.Beta2
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Labels: testsuite_failure
> Fix For: 8.2.0.CR1, 8.2.0.Final
>
>
> The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of 4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1 second.
> However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover and start the cluster properly.
> The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was that the first node (A) to start would install a view with itself as the single node, but the second node to start (B) would start immediately, and the discovery request from B would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself as the coordinator based on the order of A's and B's UUIDs, and the cluster would start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the protocol stack, the partitions would never merge and the test would fail with a timeout.
> I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response is a coordinator, when there is a single response. This worked because all but a few tests start their managers sequentially, however it sometimes introduces this 3 seconds delay when nodes start in parallel.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
More information about the infinispan-issues
mailing list