[jboss-jira] [JBoss JIRA] (JGRP-2380) Sometimes cluster members are not discovered when using TCPGOSSIP

Pavlo Fedyna (Jira) issues at jboss.org
Tue Sep 10 10:10:01 EDT 2019


Pavlo Fedyna created JGRP-2380:
----------------------------------

             Summary: Sometimes cluster members are not discovered when using TCPGOSSIP
                 Key: JGRP-2380
                 URL: https://issues.jboss.org/browse/JGRP-2380
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 4.0.19
            Reporter: Pavlo Fedyna
            Assignee: Bela Ban
         Attachments: jgroups.xml, logs_failure.txt, logs_success.txt

Sometimes new member can't join existing cluster if TCPGOSSIP is used with use_nio property set to true. In such case new member creates its own cluster with only one member of itself. After some period of time MERGE3 protocol merges these two clusters into one, but if min_interval/max_interval values are large, it may take a while.

For some reason, first try of initial discovery always finishes due to join_timeout. In this case only a few members are discovered with no coordinator.
If we are lucky enough, GMS prints following log message: "I (WO-KIT-967-28892) am not the first of the nodes, waiting for another client to become coordinator" and makes second attempt to join cluster which now takes a few milliseconds and succeeds (see logs_success.txt). In case of failure, GMS prints "I (WO-KIT-967-14786) am the first of the nodes, will become coordinator" and creates new cluster with only one member (see logs_failure.txt).

The expectations are that first try of the initial discovery should not fail due to the timeout and it should be as fast as the second one is.

Workaround: set use_nio to false (or just remove it from the stack configuration)



--
This message was sent by Atlassian Jira
(v7.13.5#713005)


More information about the jboss-jira mailing list