]
Bela Ban edited comment on JGRP-2395 at 11/6/19 8:52 AM:
---------------------------------------------------------
Hmm, these things cannot be controlled; if multiple nodes are started without an existing
coord, then (like with UDP:PING) LOCAL_PING (and SHARED_LOOPBACK_PING) may end up doing
discovery at exactly the same time and become singletons, only to be merged later.
So, in that sense, both LOCAL_PING and SHARED_LOOPBACK_PING mimic the real world (UDP
& PING).
However, I *can* change this:
* The first node to register for a given cluster becomes _coordinator_ (registration needs
to be atomic)
* When the coord leaves, the next-in-line becomes coord (this is atomic, too, wrt gets)
* On a view, we make sure that the first node in the view is the coord (also in the
{{discovery}} map). This is because of JGRP-2381.
* This _may_ fail when a user has installed a custom view generation policy, need to think
about this
The important thing here is that we need to have a coord after the first member registers.
After that, we adjust the {{discovery}} map (who is coord) based on view changes. A given
is that there's always only *one* coord for a given cluster in the {{discovery}} map.
WDYT?
was (Author: belaban):
Hmm, these things cannot be controlled; if multiple nodes are started without an existing
coord, then (like with UDP:PING) LOCAL_PING (and SHARED_LOOPBACK_PING) may end up doing
discovery at exactly the same time and become singletons, only to be merged later.
So, in that sense, both LOCAL_PING and SHARED_LOOPBACK_PING mimic the real world (UDP
& PING).
However, I *can* change this:
* The first node to register for a given cluster becomes _coordinator_ (registration needs
to be atomic)
* When the coord leaves, the next-in-line becomes coord (this is atomic, too, wrt gets)
* On a view, we make sure that the first node in the view is the coord (also in the
{{discovery}} map)
* This _may_ fail when a user has installed a custom view generation policy, need to think
about this
The important thing here is that we need to have a coord after the first member registers.
After that, we adjust the {{discovery}} map (who is coord) based on view changes. A given
is that there's always only *one* coord for a given cluster in the {{discovery}} map.
WDYT?
LOCAL_PING fails when 2 nodes start at the same time
----------------------------------------------------
Key: JGRP-2395
URL:
https://issues.jboss.org/browse/JGRP-2395
Project: JGroups
Issue Type: Bug
Affects Versions: 4.1.6
Reporter: Dan Berindei
Assignee: Bela Ban
Priority: Major
Fix For: 4.1.8
We have a test that starts 2 nodes in parallel ({{ConcurrentStartTest}} and it is
randomly failing since we started using {{LOCAL_PING}}.
{noformat}
01:02:11,930 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550:
discovery took 3 ms, members: 1 rsps (0 coords) [done]
01:02:11,930 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694:
discovery took 3 ms, members: 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: could
not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could
not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: I
(Test-NodeB-43694) am the first of the nodes, will become coordinator
01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I
(Test-NodeA-29550) am not the first of the nodes, waiting for another client to become
coordinator
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550:
discovery took 0 ms, members: 1 rsps (0 coords) [done]
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could
not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
...
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could
not determine coordinator from rsps 1 rsps (0 coords) [done]
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to
choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I
(Test-NodeA-29550) am not the first of the nodes, waiting for another client to become
coordinator
01:02:11,942 WARN (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: too many
JOIN attempts (10): becoming singleton
01:02:11,942 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550:
installing view [Test-NodeA-29550|0] (1) [Test-NodeA-29550]
01:02:11,977 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: created
cluster (first member). My view is [Test-NodeB-43694|0], impl is
org.jgroups.protocols.pbcast.CoordGmsImpl
01:02:11,977 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: created
cluster (first member). My view is [Test-NodeA-29550|0], impl is
org.jgroups.protocols.pbcast.CoordGmsImpl
{noformat}
The problem seems to be that it takes longer for the coordinator to install the initial
view and update {{LOCAL_PING}}'s {{PingData}} then it takes the other node to retry
the discovery process 10 times.
In some cases there is no retry, because one node starts slightly faster, but it's
not yet coordinator when the 2nd node does its discovery, and both nodes decide they
should be coordinator:
{noformat}
01:13:44,460 INFO (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: no
members discovered after 3 ms: creating cluster as first member
01:13:44,463 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165:
discovery took 1 ms, members: 1 rsps (0 coords) [done]
01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: could
not determine coordinator from rsps 1 rsps (0 coords) [done]
01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: nodes to
choose new coord from are: [Test-NodeB-51165, Test-NodeA-5386]
01:13:44,466 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: I
(Test-NodeB-51165) am the first of the nodes, will become coordinator
01:13:44,466 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165:
installing view [Test-NodeB-51165|0] (1) [Test-NodeB-51165]
01:13:44,466 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386:
installing view [Test-NodeA-5386|0] (1) [Test-NodeA-5386]
{noformat}
This second failure mode seems to go away if I move the {{discovery}} map access inside
the {{synchronized}} block both in {{findMembers()}} and in {{down()}}.