[jboss-user] [JBoss Cache: Core Edition] - Re: Failure Detection when group coordinator dies

Fri Jul 18 03:47:08 EDT 2008

"manik.surtani at jboss.com" wrote : Have you tried using both?

I quess we cold try if that helps...

Below is a bit more detailed example of the hanging up of our cluster. It isn't excatly the same I described above (new coordinator is assigned when current dies), but still deals with the problem related to coordinator.

Nodes in the cluster: A,B,C,D,E,F

1. Node A (the current coordinator) is shut down
 -> Node C becomes a new coordinator

2. Node A is restarted
 -> Node A sees two candidates for the coordinator: itself and C 
 -> Node A's join message to node C times out, A is unable to join the cluster

3. Node E is restarted
 -> Node E sees two candidates for the coordinator: node A (node A is currently dead!) and C 
 -> Node E's join message to node C times out, E is unable to join the cluster

4. Node C is shut down
 -> Node B becomes a new coordinator

5. Node A, E and C are restarted
 -> each node is able to join the cluster

Summary of the problems encountered:
- nodes were unable to join the cluster when it was assigned a new coordinator (C)
- even though ex-coordinator (A) was down, it was still seen as a candidate for the coordinator
- new coordinator (C) had to be shut down and new coordinator voted (B) in order to get the cluster working again

I'm wondering if the problem might be in shutting down the original coordinator (A). Here are the log messages of node A (10.195.0.121) when it is shut down, the missing ACK is from node C (10.195.0.123), which becomes a new coordinator:

  | 12:00:09,500  INFO [resin-destroy] TreeCache:1616 - stopService(): closing the channel
  | 12:00:11,567  WARN [ViewHandler] GMS:409 - failed to collect all ACKs (5) for view [10.195.0.121:60413|44] [10.195.0.123:48908, 10.195.0.112:35954, 10.195.0.122:54362, 10.195.0.120:38607, 10.195.0.105:54567] after 2000ms, missing ACKs from [10.195.0.123:48908] (received=[10.195.0.112:35954, 10.195.0.120:38607, 10.195.0.105:54567, 10.195.0.121:60413, 10.195.0.122:54362]), local_addr=10.195.0.121:60413
  | 12:00:11,572  INFO [resin-destroy] TreeCache:1622 - stopService(): stopping the dispatcher
  | 12:00:12,073  WARN [resin-destroy] TreeCache:413 - Unexpected error during removal. jboss.cache:service=TreeCache-Invalidation-Cluster-prod
  | javax.management.InstanceNotFoundException: jboss.system:service=ServiceController
  | 	at com.caucho.jmx.AbstractMBeanServer.invoke(AbstractMBeanServer.java:728)
  | 	at org.jboss.system.ServiceMBeanSupport.postDeregister(ServiceMBeanSupport.java:409)
  | 	at com.caucho.jmx.MBeanContext.unregisterMBean(MBeanContext.java:304)
  | 	at com.caucho.jmx.MBeanContext.destroy(MBeanContext.java:565)
  | 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  | 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  | 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  | 	at java.lang.reflect.Method.invoke(Method.java:585)
  | 	at com.caucho.loader.WeakCloseListener.classLoaderDestroy(WeakCloseListener.java:86)
  | 	at com.caucho.loader.Environment.closeGlobal(Environment.java:621)
  | 	at com.caucho.server.resin.ResinServer.destroy(ResinServer.java:653)
  | 	at com.caucho.server.resin.Resin$1.run(Resin.java:639)
  | 

When the A (10.195.0.121) is shut down, nodes B,D,E,F log:

  | 12:00:09,559  INFO [UpHandler (STATE_TRANSFER)] TreeCache:5673 - viewAccepted(): [10.195.0.121:60413|44] [10.195.0.123:48908, 10.195.0.112:35954, 10.195.0.122:54362, 10.195.0.120:38607, 10.195.0.105:54567]
  | 12:00:36,662  INFO [UpHandler (STATE_TRANSFER)] TreeCache:5673 - viewAccepted(): [10.195.0.123:48910|44] [10.195.0.123:48910, 10.195.0.112:35956, 10.195.0.122:54364, 10.195.0.120:38610, 10.195.0.105:54569]
  | 

but node C (10.195.0.123) only logs:

  | 12:00:36,662 INFO [UpHandler (STATE_TRANSFER)] TreeCache:5673 - viewAccepted(): [10.195.0.123:48910|44] [10.195.0.123:48910, 10.195.0.112:35956, 10.195.0.122:54364, 10.195.0.120:38610, 10.195.0.105:54569] 
  | 

And here's what we get when we are trying to start nodes A and E when C is coordinator:

  | 16:03:06,847  WARN [DownHandler (GMS)] GMS:339 - there was more than 1 candidate for coordinator: {10.195.0.121:60413=1, 10.195.0.123:48908=3}
  | 16:03:11,887  WARN [DownHandler (GMS)] GMS:127 - join(10.195.0.112:42521) sent to 10.195.0.123:48908 timed out, retrying
  | 16:03:15,907  WARN [DownHandler (GMS)] GMS:339 - there was more than 1 candidate for coordinator: {10.195.0.121:60413=1, 10.195.0.123:48908=3}
  | 16:03:20,920  WARN [DownHandler (GMS)] GMS:127 - join(10.195.0.112:42521) sent to 10.195.0.123:48908 timed out, retrying
  | ...

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4165253#4165253

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4165253