[jboss-user] [JBossCache] - wrong coordinator causes join to fail

bruyeron do-not-reply at jboss.com
Tue Nov 7 07:22:25 EST 2006


I had a problem recently in production whereby one of the instances in the cluster failed and had to be terminated (via kill -9).
This is part of a cluster of 4 servers, on which there are 12 cache instances (1 JVM per server, 3 cache per JVM) in REPL_ASYNC mode.

After a node failure, we restarted one of the JVMs, and then restarted 2 of the remaining JVMs. To make things simple, we first restarted B, then A and D, but left C running.

We noticed the following messages in the logs of A B and D after restart:
 06/11/2006 14:10:24  WARN [ClientGmsImpl.java:126] - join(A:32937) sent to B:32955 timed out, retrying

B:32955 was the coordinator before B was killed with kill -9. It seems that C (the remaining member) incorrectly things that B:32955 is still the coordinator. Here's the protocol stack I am using:
UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=${treeCache.mcastAddress};mcast_port=${treeCache.mcastPort};mcast_recv_buf_size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000;bind_addr=${treeCache.bind_addr}):\
PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):\
MERGE2(max_interval=20000;min_interval=10000):\
FD_SOCK:\
VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):\
pbcast.NAKACK(down_thread=false;gc_lag=50;retransmit_timeout=600,1200,2400,4800;up_thread=false):\
pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):\
UNICAST(down_thread=false;;timeout=600,1200,2400):\
FRAG(down_thread=false;frag_size=8192;up_thread=false):\
pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):\
pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)

When I tried to replicate this scenario on my dev system, the failure detection worked and a new coordinator was successfully elected - therefore I think I may have hit upon a borderline condition.

Any idea on what could be going on?

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3983716#3983716

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3983716



More information about the jboss-user mailing list