Ajay Sharma created JGRP-2501:
---------------------------------
Summary: Jgroup view not stabilized after upgrading from 3.4.3 to 4.01.0
Key: JGRP-2501
URL:
https://issues.redhat.com/browse/JGRP-2501
Project: JGroups
Issue Type: Bug
Affects Versions: 4.0.10
Reporter: Ajay Sharma
Assignee: Bela Ban
Hi
we have 15 node cluster after upgrading Jgroup from 3.4.3 to 4.0.10, system is unstable
and keep getting below logs
2020-09-03 11:47:30.317 WARN org.jgroups.protocols.pbcast.GMS - vmc0198-27827: not member
of view [vmc0208-48939|123]; discarding it
2020-09-03 11:47:32.316 WARN org.jgroups.protocols.pbcast.GMS - vmc0198-27827: failed
to create view from delta-view; dropping view: java.lang.IllegalStateException: the
view-id of the delta view ([vmc0208-48939|123]) doesn't match the current view-id
([vmc0208-48939|122]); discarding delta view [vmc0208-48939|124],
ref-view=[vmc0208-48939|123], joined=[vmc0198-5504]
2020-09-03 11:47:32.323 WARN org.jgroups.protocols.pbcast.GMS - vmc0198-27827: not
member of view [vmc0208-48939|124]; discarding it.
2020-09-03 11:49:07.160 WARN org.jgroups.protocols.pbcast.NAKACK2 - JGRP000011:
vmc0198-63871: dropped message batch from non-member vmc0201-28703
(view=MergeView::[vmc0208-48939|140] (24) [ ***REMOVING MACHINE NAME AND PORT ***] ])
2020-09-03 11:49:07.160 WARN org.jgroups.protocols.pbcast.NAKACK2 - JGRP000011:
vmc0198-23411: dropped message batch from non-member vmc0201-28703 (view=[***REMOVING
MACHINE NAME AND PORT FOR CLEAR VIEW ***] .])
2020-09-05 16:16:07.380 DEBUG org.jgroups.protocols.FD_ALL - haven't received a
heartbeat from vmc0201-55458 for 12541 ms, adding it to suspect list
2020-09-05 16:16:07.535 DEBUG org.jgroups.protocols.FD_SOCK - vmc0198-24881: failed
connecting to vmc0204-45403: connect timed out
2020-09-05 16:16:07.536 DEBUG org.jgroups.protocols.FD_SOCK - vmc0198-24881:
broadcasting suspect(vmc0204-45403)
2020-09-05 16:16:07.536 DEBUG org.jgroups.protocols.FD_SOCK - vmc0198-24881:
pingable_mbrs=[***REMOVING MACHINE NAME AND PORT ***], ping_dest=vmc0204-54485
2020-09-05 16:16:08.513 DEBUG org.jgroups.protocols.pbcast.GMS - vmc0198-52842:
installing view [ ***REMOVING MACHINE NAME AND PORT FOR CLEAR VIEW *** ]
2020-09-05 16:16:08.513 DEBUG org.jgroups.protocols.pbcast.GMS - vmc0198-24881:
installing view [vmc0200-30543|2672] (184) [ ***REMOVING MACHINE NAME AND PORT FOR CLEAR
VIEW *** ]
===================================
To isolate the issue we have created a small program both in Jgroup 3.4.3 and Jgroups
4.0.10
Both applications take IP addresses and the number of channels as arguments. We have run
both applications in the following matrix and collected view data and timings.
Below are the stats:
Number of members (number of nodes x number of channels) Jgroups 3.4.3
Jgroup 4.0.10
225 (15x15) Simultaneous start
25 - 30 seconds* 15 minutes**
225 (15x15) Rolling start (view after 15th node start) 20 seconds*
10 minutes**
196 (14x14) Simultaneous start
25 seconds* 4 minutes**
169 (13x13) Simultaneous start
30 - 31 seconds* 7 minutes**
144 (12x12) Simultaneous start
27 seconds* 5 minutes**
121 (11x11) Simultaneous start
22 seconds* 2 minutes**
100 (10x10) Simultaneous start
20 seconds* 5 minutes**
...
...
9 to 49 channels (3x3) to (7x7) almost immediate* almost immediate*
Note: Even after taking 15 minutes, views are not stable its keeps fluctuating.
=======
Below are my protocols used with properties:
Protocol[] protocolStack={
new
UDP().setValue("bind_addr",InetAddress.getByName(myBindAddress)).setValue("mcast_port",
10600).setValue("bind_port", 10601)
.setValue("port_range",
100).setValue("diagnostics_bind_interfaces",
parInterfaceList).setValue("diagnostics_port", 10599),
new PING(),
new MERGE3(),
new FD_SOCK().setValue("bind_addr",
InetAddress.getByName(myBindAddress)),
new FD_ALL().setValue("timeout",
12000).setValue("interval", 3000),
new VERIFY_SUSPECT().setValue("bind_addr",
InetAddress.getByName(myBindAddress)),
new BARRIER(),
new NAKACK2(),
new UNICAST3(),
new STABLE(),
new GMS().setValue("print_local_addr", true),
new UFC(),
new MFC(),
new FRAG2()};
However, we tried to update below few properties value but no luck
thread_pool_max_threads = 200 in UDP()
Default values of FD_ALL()
--
This message was sent by Atlassian Jira
(v7.13.8#713008)