[jboss-jira] [JBoss JIRA] (JGRP-2395) LOCAL_PING fails when 2 nodes start at the same time
Dan Berindei (Jira)
issues at jboss.org
Mon Nov 11 08:15:00 EST 2019
[ https://issues.jboss.org/browse/JGRP-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810666#comment-13810666 ]
Dan Berindei commented on JGRP-2395:
------------------------------------
Looks good [~belaban]. I tried the snapshot and parallel start works fine, but I discovered another bug that I'd like fixed in 4.1.8: JGRP-2398.
I wish we could do something similar for production discovery protocols like M/PING though. I know it's a lot more complicated to do it when you don't have up-to-date information about the other nodes, but it's also really surprising when 2 nodes can see each other when they start and they still don't form a single cluster. I guess it's similar to how a node can be included in a view without actually installing the view, and the coordinator just ignores the missing {{VIEW_ACK}}, so I should be already used to it, but it still surprises me when it happens.
> LOCAL_PING fails when 2 nodes start at the same time
> ----------------------------------------------------
>
> Key: JGRP-2395
> URL: https://issues.jboss.org/browse/JGRP-2395
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 4.1.6
> Reporter: Dan Berindei
> Assignee: Bela Ban
> Priority: Major
> Fix For: 4.1.8
>
>
> We have a test that starts 2 nodes in parallel ({{ConcurrentStartTest}} and it is randomly failing since we started using {{LOCAL_PING}}.
> {noformat}
> 01:02:11,930 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: discovery took 3 ms, members: 1 rsps (0 coords) [done]
> 01:02:11,930 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: discovery took 3 ms, members: 1 rsps (0 coords) [done]
> 01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: could not determine coordinator from rsps 1 rsps (0 coords) [done]
> 01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not determine coordinator from rsps 1 rsps (0 coords) [done]
> 01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
> 01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
> 01:02:11,931 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: I (Test-NodeB-43694) am the first of the nodes, will become coordinator
> 01:02:11,931 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I (Test-NodeA-29550) am not the first of the nodes, waiting for another client to become coordinator
> 01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: discovery took 0 ms, members: 1 rsps (0 coords) [done]
> 01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not determine coordinator from rsps 1 rsps (0 coords) [done]
> 01:02:11,932 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
> ...
> 01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: could not determine coordinator from rsps 1 rsps (0 coords) [done]
> 01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: nodes to choose new coord from are: [Test-NodeB-43694, Test-NodeA-29550]
> 01:02:11,941 TRACE (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: I (Test-NodeA-29550) am not the first of the nodes, waiting for another client to become coordinator
> 01:02:11,942 WARN (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: too many JOIN attempts (10): becoming singleton
> 01:02:11,942 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: installing view [Test-NodeA-29550|0] (1) [Test-NodeA-29550]
> 01:02:11,977 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-43694: created cluster (first member). My view is [Test-NodeB-43694|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
> 01:02:11,977 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-29550: created cluster (first member). My view is [Test-NodeA-29550|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
> {noformat}
> The problem seems to be that it takes longer for the coordinator to install the initial view and update {{LOCAL_PING}}'s {{PingData}} then it takes the other node to retry the discovery process 10 times.
> In some cases there is no retry, because one node starts slightly faster, but it's not yet coordinator when the 2nd node does its discovery, and both nodes decide they should be coordinator:
> {noformat}
> 01:13:44,460 INFO (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: no members discovered after 3 ms: creating cluster as first member
> 01:13:44,463 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: discovery took 1 ms, members: 1 rsps (0 coords) [done]
> 01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: could not determine coordinator from rsps 1 rsps (0 coords) [done]
> 01:13:44,465 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: nodes to choose new coord from are: [Test-NodeB-51165, Test-NodeA-5386]
> 01:13:44,466 TRACE (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: I (Test-NodeB-51165) am the first of the nodes, will become coordinator
> 01:13:44,466 DEBUG (ForkThread-2,ConcurrentStartTest:[]) [GMS] Test-NodeB-51165: installing view [Test-NodeB-51165|0] (1) [Test-NodeB-51165]
> 01:13:44,466 DEBUG (ForkThread-1,ConcurrentStartTest:[]) [GMS] Test-NodeA-5386: installing view [Test-NodeA-5386|0] (1) [Test-NodeA-5386]
> {noformat}
> This second failure mode seems to go away if I move the {{discovery}} map access inside the {{synchronized}} block both in {{findMembers()}} and in {{down()}}.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)
More information about the jboss-jira
mailing list