[
https://jira.jboss.org/jira/browse/JGRP-1058?page=com.atlassian.jira.plug...
]
Stuart Jensen commented on JGRP-1058:
-------------------------------------
We got more information from our customer and the description of the problem has changed a
little bit and we have found a solution to the problem but it did require a change to the
JGroups code.
The bottom line was that the customer's network was slow and so we had to increase the
amount of time allowed to the JGroups protocols to shutdown. Specifically, in
ProtocolStack.java, in the method stopStack(), we changed the timeout on the call to
stop_promise.getResult(10 * 60 * 1000) to 10 minutes, I believe that was up from 5
seconds. In the customer's environment, that call was taking about 2 minutes to
complete. If we did not allow if to complete successfully, then our subsequent
initialization of a new JChannel() would apparently "take over" usage of some of
the existing connnections and JGroups would get all confused. It would send messages to
the old connection and never get an answer back so it would just continually retry
connecting to the wrong connection.
I am not real clear as to what was happening when the shutdown was not allowed to
complete, but, in a nutshell, we saw the new JChannel talking to old connections. To fix
the problem, the customer had to shut down all of the cluster members and bring them up
again. Once we allowed the shut down more time to complete, we were able to shutdown
JGroups JChannel and then bring up a new one without any problems. The customer is now
running without problems.
When we were debugging and trying to figure out why the shutdown seemed to take so long,
it appeared that the shutdown was always stuck in FD_SOCK wiating for something
(connection close?).
Split Cluster Never Recovers
----------------------------
Key: JGRP-1058
URL:
https://jira.jboss.org/jira/browse/JGRP-1058
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.7
Environment: Suse Linux
Reporter: Stuart Jensen
Assignee: Bela Ban
Priority: Critical
Fix For: 2.6.14
We are using JGroups Version 2.6.7 GA.
When a cluster spans at least two subnets, cluster members become disconnected and the
only way to get them to reconnect to the cluster is to bring all of the processes down and
bring them back up at the same time.
Bouncing one box at a time does not work. We have not seen this issue at all when all of
the cluster members are in the same subnet.
Also happened in JGroups version 2.3 SP1.
This is an intermittent problem. Customers can normally run for several days without
issue. Then the cluster will split and never fix itself. The only solution is to bring
down all boxes.
The configuration that is active when the situation occurs is:
TCP(start_port=7801;external_addr=192.168.218.62):
TCPPING(initial_hosts=192.168.218.62[7801],192.168.128.62[7801];port_range=2;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):
MERGE2(min_interval=5000;max_interval=10000):
FD_SOCK(bind_addr=192.168.218.62):
FD(shun=true;timeout=2500;max_tries=5;up_thread=true;down_thread=true):
VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):
pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
pbcast.STATE_TRANSFER(down_thread=false;up_thread=false):
pbcast.GMS(join_timeout=60000;join_retry_timeout=60000;shun=true;print_local_addr=true;down_thread=true;up_thread=true)
I will be posting logs from the customer's site shortly.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira