[JBoss JIRA] (JGRP-2395) LOCAL_PING fails when 2 nodes start at the same time

Monday, 4 November 2019

Dan Berindei created JGRP-2395:
----------------------------------

             Summary: LOCAL_PING fails when 2 nodes start at the same time
                 Key: JGRP-2395
                 URL: https://issues.jboss.org/browse/JGRP-2395
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.1.6
            Reporter: Dan Berindei
            Assignee: Bela Ban

We have a test that starts 2 nodes in parallel ({{ConcurrentStartTest}} and it is randomly
failing since we started using {{LOCAL_PING}}.

{noformat}
01:02:11,930 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: discovery
took 3 ms, members: 1 rsps (0 coords) [done]
01:02:11,930 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: discovery
took 3 ms, members: 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: could not
determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not
determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: I
(Test-NodeB-43694) am the first of the nodes, will become coordinator
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I
(Test-NodeA-29550) am not the first of the nodes, waiting for another client to become
coordinator
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: discovery
took 0 ms, members: 1 rsps (0 coords) [done]
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not
determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
... 
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not
determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I
(Test-NodeA-29550) am not the first of the nodes, waiting for another client to become
coordinator
01:02:11,942 WARN  (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: too many
JOIN attempts (10): becoming singleton
01:02:11,942 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550:
installing view [Test-NodeA-29550|0] (1) [Test-NodeA-29550]
01:02:11,977 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: created
cluster (first member). My view is [Test-NodeB-43694|0], impl is
org.jgroups.protocols.pbcast.CoordGmsImpl
01:02:11,977 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: created
cluster (first member). My view is [Test-NodeA-29550|0], impl is
org.jgroups.protocols.pbcast.CoordGmsImpl
{noformat}

The problem seems to be that it takes longer for the coordinator to install the initial
view and update {{LOCAL_PING}}'s {{PingData}} then it takes the other node to retry
the discovery process 10 times.

In some cases there is no retry, because one node starts slightly faster, but it's not
yet coordinator when the 2nd node does its discovery, and both nodes decide they should be
coordinator:

{noformat}
01:13:44,460 INFO  (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: no members
discovered after 3 ms: creating cluster as first member
01:13:44,463 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: discovery
took 1 ms, members: 1 rsps (0 coords) [done]
01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: could not
determine coordinator from rsps 1 rsps (0 coords) [done]
01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: nodes to
choose new coord from are: [Test-NodeB-51165, Test-NodeA-5386]
01:13:44,466 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: I
(Test-NodeB-51165) am the first of the nodes, will become coordinator
01:13:44,466 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165:
installing view [Test-NodeB-51165|0] (1) [Test-NodeB-51165]
01:13:44,466 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: installing
view [Test-NodeA-5386|0] (1) [Test-NodeA-5386]
{noformat}

This second failure mode seems to go away if I move the {{discovery}} map access inside
the {{synchronized}} block both in {{findMembers()}} and in {{down()}}.

--
This message was sent by Atlassian Jira
(v7.13.8#713008)

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006