[jboss-jira] [JBoss JIRA] (JGRP-2380) Sometimes cluster members are not discovered when using TCPGOSSIP

Tue Sep 17 05:59:00 EDT 2019

    [ https://issues.jboss.org/browse/JGRP-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785216#comment-13785216 ] 

Bela Ban commented on JGRP-2380:
--------------------------------

I can't reproduce this (on master). I tried to start the GossipRouter with {{-nio false}} and {{-nio true}}, and it worked in both cases.
If you can post instructions on how to reproduce this, that would be helpful.

A few questions/comments/things to do:
* As I said before, try this with the latest stable release (4.1.5, released this week)
* Try to increase GMS.join_timeout, to see if this helps
* You're apparently using FlagsUUID ({{ ROG (flags=2)(_V=4.0)}}), what for? Are you mixing regular UUIDs and FlagsUUIDs?
* I'm seeing view ids of {{65768}} in your logs, this is very unusual, and only happens when you have a large cluster, or a high rate of joines and leaves.

> Sometimes cluster members are not discovered when using TCPGOSSIP
> -----------------------------------------------------------------
>
>                 Key: JGRP-2380
>                 URL: https://issues.jboss.org/browse/JGRP-2380
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.0.19
>            Reporter: Pavlo Fedyna
>            Assignee: Bela Ban
>            Priority: Minor
>             Fix For: 4.1.5
>
>         Attachments: jgroups.xml, logs_failure.txt, logs_success.txt
>
>
> Sometimes new member can't join existing cluster if TCPGOSSIP is used with use_nio property set to true. In such case new member creates its own cluster with only one member of itself. After some period of time MERGE3 protocol merges these two clusters into one, but if min_interval/max_interval values are large, it may take a while.
> For some reason, first try of initial discovery always finishes due to join_timeout. In this case only a few members are discovered with no coordinator.
> If we are lucky enough, GMS prints following log message: "I (WO-KIT-967-28892) am not the first of the nodes, waiting for another client to become coordinator" and makes second attempt to join cluster which now takes a few milliseconds and succeeds (see logs_success.txt). In case of failure, GMS prints "I (WO-KIT-967-14786) am the first of the nodes, will become coordinator" and creates new cluster with only one member (see logs_failure.txt).
> The expectations are that first try of the initial discovery should not fail due to the timeout and it should be as fast as the second one is.
> Workaround: set use_nio to false (or just remove it from the stack configuration)

--
This message was sent by Atlassian Jira
(v7.13.5#713005)