New subject: [JBoss JIRA] Updated: (JGRP-1058) Split Cluster Never Recovers

Monday, 21 September 2009

Split Cluster Never Recovers
----------------------------

                 Key: JGRP-1058
                 URL: https://jira.jboss.org/jira/browse/JGRP-1058
             Project: JGroups
          Issue Type: Bug
    Affects Versions: 2.6.7
         Environment: Suse Linux
            Reporter: Stuart Jensen
            Assignee: Bela Ban
            Priority: Critical

We are using  JGroups Version 2.6.7 GA.

When a cluster spans at least two subnets, cluster members become disconnected and the
only way to get them to reconnect to the cluster is to bring all of the processes down and
bring them back up at the same time.

Bouncing one box at a time does not work.  We have not seen this issue at all when all of
the cluster members are in the same subnet.

Also happened in JGroups version 2.3 SP1.

This is an intermittent problem. Customers can normally run for several days without
issue. Then the cluster will split and never fix itself.  The only solution is to bring
down all boxes.

The configuration that is active when the situation occurs is:

TCP(start_port=7801;external_addr=192.168.218.62):
TCPPING(initial_hosts=192.168.218.62[7801],192.168.128.62[7801];port_range=2;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):
MERGE2(min_interval=5000;max_interval=10000):
FD_SOCK(bind_addr=192.168.218.62):
FD(shun=true;timeout=2500;max_tries=5;up_thread=true;down_thread=true):
VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):
pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
pbcast.STATE_TRANSFER(down_thread=false;up_thread=false):
pbcast.GMS(join_timeout=60000;join_retry_timeout=60000;shun=true;print_local_addr=true;down_thread=true;up_thread=true)

I will be posting logs from the customer's site shortly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBoss JIRA] Created: (JGRP-1058) Split Cluster Never Recovers