[jboss-jira] [JBoss JIRA] (JGRP-2293) Graceful concurrent leaving of coordinator(s) leaves the cluster with stale views

Wed Feb 6 08:07:05 EST 2019

    [ https://issues.jboss.org/browse/JGRP-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691785#comment-13691785 ] 

Dan Berindei commented on JGRP-2293:
------------------------------------

[~belaban] yeah, the {{.0}} was a typo and I removed it. The command line is exactly the one I used.

Not sure about the non-loopback requirement, the test hard-codes the TCP bind address to {{127.0.0.1}}, but doesn't set any property on MPING, meaning it will use [default mcast address 230.5.6.7|https://github.com/belaban/JGroups/blob/de91ff4e222011d287016a7e732b9cedce587912/src/org/jgroups/protocols/MPING.java#L51] and bind to [system property  {{jgroups.bind_addr}}=localhost|https://github.com/belaban/JGroups/blob/e103fabd73ff783a456036f3071ab9011ce87da4/build.properties.template#L6]

 I've enabled trace logs and both the passing IDE run and the failing Maven run report the same addresses:

{noformat}
mvn: 14:41:47,971 DEBUG (main:[]) [MPING] bind_addr=/127.0.0.1, mcast_addr=/230.5.6.7, mcast_port=7555
ide: 14:39:50,826 DEBUG (main:[]) [MPING] bind_addr=/127.0.0.1, mcast_addr=/230.5.6.7, mcast_port=7555
{noformat}

Unfortunately there isn't anything else interesting in the logs. I did get one of the test methods to fail in the IDE, but I think the problem is in the test. You may want to get rid of the streams, it looks pretty tough to debug as is:

{noformat}
FAIL: [1] org.jgroups.tests.LeaveTest.testCoordLeave()

java.lang.AssertionError
	at org.jgroups.tests.LeaveTest.testCoordLeave(LeaveTest.java:71)
{noformat}

> Graceful concurrent leaving of coordinator(s) leaves the cluster with stale views
> ---------------------------------------------------------------------------------
>
>                 Key: JGRP-2293
>                 URL: https://issues.jboss.org/browse/JGRP-2293
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 4.0.14
>            Reporter: Radoslav Husar
>            Assignee: Bela Ban
>            Priority: Critical
>             Fix For: 4.0.17
>
>         Attachments: IMG_20190123_124154.jpg
>
>
> JGroups does not handle concurrent leaving of nodes correctly. This is a typical use case in cloud environment when scaled down with an autoscaler/manually which we need to handle.
> A simple test can be devised which fails first n (where n>1) nodes from a cluster, reproducer PR https://github.com/belaban/JGroups/pull/397

--
This message was sent by Atlassian Jira
(v7.12.1#712002)