[jboss-jira] [JBoss JIRA] Commented: (JGRP-1058) Split Cluster Never Recovers
Stuart Jensen (JIRA)
jira-events at lists.jboss.org
Fri Oct 30 13:13:06 EDT 2009
[ https://jira.jboss.org/jira/browse/JGRP-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12492356#action_12492356 ]
Stuart Jensen commented on JGRP-1058:
-------------------------------------
We got more information from our customer and the description of the problem has changed a little bit and we have found a solution to the problem but it did require a change to the JGroups code.
The bottom line was that the customer's network was slow and so we had to increase the amount of time allowed to the JGroups protocols to shutdown. Specifically, in ProtocolStack.java, in the method stopStack(), we changed the timeout on the call to stop_promise.getResult(10 * 60 * 1000) to 10 minutes, I believe that was up from 5 seconds. In the customer's environment, that call was taking about 2 minutes to complete. If we did not allow if to complete successfully, then our subsequent initialization of a new JChannel() would apparently "take over" usage of some of the existing connnections and JGroups would get all confused. It would send messages to the old connection and never get an answer back so it would just continually retry connecting to the wrong connection.
I am not real clear as to what was happening when the shutdown was not allowed to complete, but, in a nutshell, we saw the new JChannel talking to old connections. To fix the problem, the customer had to shut down all of the cluster members and bring them up again. Once we allowed the shut down more time to complete, we were able to shutdown JGroups JChannel and then bring up a new one without any problems. The customer is now running without problems.
When we were debugging and trying to figure out why the shutdown seemed to take so long, it appeared that the shutdown was always stuck in FD_SOCK wiating for something (connection close?).
> Split Cluster Never Recovers
> ----------------------------
>
> Key: JGRP-1058
> URL: https://jira.jboss.org/jira/browse/JGRP-1058
> Project: JGroups
> Issue Type: Bug
> Affects Versions: 2.6.7
> Environment: Suse Linux
> Reporter: Stuart Jensen
> Assignee: Bela Ban
> Priority: Critical
> Fix For: 2.6.14
>
>
> We are using JGroups Version 2.6.7 GA.
> When a cluster spans at least two subnets, cluster members become disconnected and the only way to get them to reconnect to the cluster is to bring all of the processes down and bring them back up at the same time.
> Bouncing one box at a time does not work. We have not seen this issue at all when all of the cluster members are in the same subnet.
> Also happened in JGroups version 2.3 SP1.
> This is an intermittent problem. Customers can normally run for several days without issue. Then the cluster will split and never fix itself. The only solution is to bring down all boxes.
> The configuration that is active when the situation occurs is:
> TCP(start_port=7801;external_addr=192.168.218.62):
> TCPPING(initial_hosts=192.168.218.62[7801],192.168.128.62[7801];port_range=2;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):
> MERGE2(min_interval=5000;max_interval=10000):
> FD_SOCK(bind_addr=192.168.218.62):
> FD(shun=true;timeout=2500;max_tries=5;up_thread=true;down_thread=true):
> VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):
> pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
> pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
> pbcast.STATE_TRANSFER(down_thread=false;up_thread=false):
> pbcast.GMS(join_timeout=60000;join_retry_timeout=60000;shun=true;print_local_addr=true;down_thread=true;up_thread=true)
> I will be posting logs from the customer's site shortly.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
More information about the jboss-jira
mailing list