[jboss-jira] [JBoss JIRA] Closed: (JGRP-1058) Split Cluster Never Recovers

Fri Feb 12 07:03:10 EST 2010

     [ https://jira.jboss.org/jira/browse/JGRP-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bela Ban closed JGRP-1058.
--------------------------

    Resolution: Cannot Reproduce Bug

I'm closing this issue, as I cannot reproduce it. If you can recreate it, can you create a new JIRA and attach instructions / code on how to reproduce it ?

> Split Cluster Never Recovers
> ----------------------------
>
>                 Key: JGRP-1058
>                 URL: https://jira.jboss.org/jira/browse/JGRP-1058
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.6.7
>         Environment: Suse Linux
>            Reporter: Stuart Jensen
>            Assignee: Bela Ban
>            Priority: Critical
>             Fix For: 2.6.14
>
>
> We are using  JGroups Version 2.6.7 GA.
> When a cluster spans at least two subnets, cluster members become disconnected and the only way to get them to reconnect to the cluster is to bring all of the processes down and bring them back up at the same time.
> Bouncing one box at a time does not work.  We have not seen this issue at all when all of the cluster members are in the same subnet.
> Also happened in JGroups version 2.3 SP1.
> This is an intermittent problem. Customers can normally run for several days without issue. Then the cluster will split and never fix itself.  The only solution is to bring down all boxes.
> The configuration that is active when the situation occurs is:
> TCP(start_port=7801;external_addr=192.168.218.62):
> TCPPING(initial_hosts=192.168.218.62[7801],192.168.128.62[7801];port_range=2;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):
> MERGE2(min_interval=5000;max_interval=10000):
> FD_SOCK(bind_addr=192.168.218.62):
> FD(shun=true;timeout=2500;max_tries=5;up_thread=true;down_thread=true):
> VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):
> pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
> pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
> pbcast.STATE_TRANSFER(down_thread=false;up_thread=false):
> pbcast.GMS(join_timeout=60000;join_retry_timeout=60000;shun=true;print_local_addr=true;down_thread=true;up_thread=true)
> I will be posting logs from the customer's site shortly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira