[JBoss JIRA] (ISPN-6239) InitialClusterSizeTest.testInitialClusterSizeFail random failures

Wednesday, 16 March 2016

    [
https://issues.jboss.org/browse/ISPN-6239?page=com.atlassian.jira.plugin....
] 

Dan Berindei commented on ISPN-6239:
------------------------------------

While trying to reproduce the failure on my machine, I found another failure caused by a
concurrency issue in {{TEST_PING}}:

{noformat}
12:44:36,043 TRACE (ForkThread-4,InitialClusterSizeTest:) [TEST_PING] Discoveries for
DiscoveryKey{clusterName='ISPN',
testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are : {}
12:44:36,043 TRACE (ForkThread-1,InitialClusterSizeTest:) [TEST_PING] Discoveries for
DiscoveryKey{clusterName='ISPN',
testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are : {}
12:44:36,043 TRACE (ForkThread-1,InitialClusterSizeTest:) [TEST_PING] Add discovery for
NodeA-45697 to cache.  The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921,
NodeA-45697=TEST_PING@NodeA-45697}
12:44:36,043 TRACE (ForkThread-4,InitialClusterSizeTest:) [TEST_PING] Add discovery for
NodeD-30921 to cache.  The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921,
NodeA-45697=TEST_PING@NodeA-45697}
12:44:36,043 TRACE (ForkThread-3,InitialClusterSizeTest:) [TEST_PING] Discoveries for
DiscoveryKey{clusterName='ISPN',
testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are :
{NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697}
12:44:36,043 TRACE (ForkThread-3,InitialClusterSizeTest:) [TEST_PING] Add discovery for
NodeC-59583 to cache.  The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921,
NodeA-45697=TEST_PING@NodeA-45697, NodeC-59583=TEST_PING@NodeC-59583}
12:44:36,043 TRACE (ForkThread-2,InitialClusterSizeTest:) [TEST_PING] Discoveries for
DiscoveryKey{clusterName='ISPN',
testName='org.infinispan.remoting.transport.InitialClusterSizeTest'} are :
{NodeD-30921=TEST_PING@NodeD-30921, NodeA-45697=TEST_PING@NodeA-45697,
NodeC-59583=TEST_PING@NodeC-59583}
12:44:36,044 TRACE (ForkThread-2,InitialClusterSizeTest:) [TEST_PING] Add discovery for
NodeB-6005 to cache.  The cache now contains: {NodeD-30921=TEST_PING@NodeD-30921,
NodeA-45697=TEST_PING@NodeA-45697, NodeB-6005=TEST_PING@NodeB-6005,
NodeC-59583=TEST_PING@NodeC-59583}
12:44:36,044 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: discovery
took 2 ms, members: 1 rsps (0 coords) [done]
12:44:36,044 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: could not
determine coordinator from rsps 1 rsps (0 coords) [done]
12:44:36,045 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: nodes to
choose new coord from are: [NodeD-30921, NodeA-45697]
12:44:36,045 TRACE (ForkThread-4,InitialClusterSizeTest:) [GMS] NodeD-30921: I
(NodeD-30921) am the first of the nodes, will become coordinator
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: discovery took
3 ms, members: 3 rsps (0 coords) [done]
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: could not
determine coordinator from rsps 3 rsps (0 coords) [done]
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: nodes to
choose new coord from are: [NodeC-59583, NodeD-30921, NodeB-6005, NodeA-45697]
12:44:36,045 TRACE (ForkThread-2,InitialClusterSizeTest:) [GMS] NodeB-6005: I (NodeB-6005)
am not the first of the nodes, waiting for another client to become coordinator
{noformat}

The cluster starts as 2 partitions with NodeB and NodeD as coordinators, and because the
test doesn't use {{TransportFlags.withMerge()}}, the partitions will never merge.

...
 InitialClusterSizeTest.testInitialClusterSizeFail random failures
 -----------------------------------------------------------------

                 Key: ISPN-6239
                 URL: https://issues.jboss.org/browse/ISPN-6239
             Project: Infinispan
          Issue Type: Bug
          Components: Test Suite - Core
    Affects Versions: 8.2.0.Beta2
            Reporter: Dan Berindei
            Assignee: Dan Berindei
              Labels: testsuite_failure
             Fix For: 8.2.0.CR1, 8.2.0.Final

 The test starts 3 nodes concurrently, but configures Infinispan to wait for a cluster of
4 nodes, and expects that the nodes fail to start in {{initialClusterTimeout}} + 1
second.
 However, because of a bug in {{TEST_PING}}, the first 2 nodes see each other as
coordinator and send a {{JOIN}} request to each other, and it takes 3 seconds to recover
and start the cluster properly.
 The bug in {{TEST_PING}} is actually a hack introduced for {{ISPN-5106}}. The problem was
that the first node (A) to start would install a view with itself as the single node, but
the second node to start (B) would start immediately, and the discovery request from B
would reach B's {{TEST_PING}} before it saw the view. That way, B could choose itself
as the coordinator based on the order of A's and B's UUIDs, and the cluster would
start as 2 partitions. Since most of our tests actually remove {{MERGE3}} from the
protocol stack, the partitions would never merge and the test would fail with a timeout.
 I fixed this in {{TEST_PING}} by assuming that the sender of the first discovery response
is a coordinator, when there is a single response. This worked because all but a few tests
start their managers sequentially, however it sometimes introduces this 3 seconds delay
when nodes start in parallel. 

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009