[jboss-jira] [JBoss JIRA] (JGRP-2209) Members leaving the cluster

Swathi Kumar (JIRA) issues at jboss.org
Mon Aug 7 18:31:00 EDT 2017


Swathi Kumar created JGRP-2209:
----------------------------------

             Summary: Members leaving the cluster
                 Key: JGRP-2209
                 URL: https://issues.jboss.org/browse/JGRP-2209
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 3.0
         Environment: Linux
            Reporter: Swathi Kumar
            Assignee: Bela Ban


We recently upgraded the jgroups jars from version 2_5_2/jgroups-all.jar to 3_4_0/jgroups-3.4.0.Alpha2.jar.

With the upgrade we see our clusters are not stable.
The members leave the cluster for short duration of time (say around 5-6m) and join back on their own.
We initially suspected it to be a network issue and we involved the network team to investiate further.
But after reviewing the network logs, it is very much evident that the network has no role to play in members leaving the cluster. The boxes on which the nodes/members are running are healthy and fine and the network is very fast and healthy too.

We are not able to determine the root cause for the members leaving the clusters.
Please note, we have multiple clusters configured (round about 5-6) and we are experiencing the same problem on all the clusters.

We request you to kindly help us in resolving this issue.

We have the below jgroups config properties in our application to create 3 channels (for security reasons have used a dummy host name here) :-

jgroups_cluster.property_string=TCP(bind_addr=host_name_A;bind_port=34061):TCPPING(initial_hosts=host_name_A[34061],host_name_A[44061],host_name_A[54061];port_range=1;timeout=5000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_ALL(interval=5000;timeout=20000):FD(timeout=5000;max_tries=48):VERIFY_SUSPECT(timeout=1500):pbcast.NAKACK(retransmit_timeout=100,200,300,600,1200,2400,4800;discard_delivered_msgs=true):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(print_local_addr=true;join_timeout=5000)

jgroups_cluster.distribution_property_string=TCP(bind_port= 34060;thread_pool_rejection_policy=run):TCPPING(initial_hosts=host_name_A[34060],host_name_A[44060],host_name_A[54060];port_range=1;timeout=5000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_SOCK:FD(timeout=5000;max_tries=48):VERIFY_SUSPECT(timeout=1500):pbcast.NAKACK(retransmit_timeout=3000;discard_delivered_msgs=true):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(join_timeout=5000;print_local_addr=true)

jgroups_cluster.lock.protocolStack=TCP(bind_addr=host_name_A;bind_port=34062:TCPPING(initial_hosts=host_name_A[34062],host_name_A[44062],host_name_A[54062];port_range=1;timeout=5000;num_initial_members=2):MERGE2(min_interval=3000;max_interval=5000):FD_ALL(interval=5000;timeout=20000):FD(timeout=5000;max_tries=48):VERIFY_SUSPECT(timeout=1500):pbcast.NAKACK(retransmit_timeout=100,200,300,600,1200,2400,4800;discard_delivered_msgs=true):pbcast.STABLE(stability_delay=1000;desired_avg_gossip=20000;max_bytes=0):pbcast.GMS(print_local_addr=true;join_timeout=5000)

Please let us know if you need any logs from our end.
Kindly consider this on priority as our business is at stake with this issue happening on a daily basis.
Many thanks in advance.

Regards
Swathi BN
(IBM)





--
This message was sent by Atlassian JIRA
(v7.2.3#72005)


More information about the jboss-jira mailing list