[jboss-user] [JBoss Cache: Core Edition] - JBoss cache instances fail to join cluster after bounce

setatum do-not-reply at jboss.com
Wed May 6 12:02:42 EDT 2009


We are experiencing a problem with a 3-node JBoss Cache setup. All three nodes startup fine and changes propogate as expected. However, if we later on restart one of our app servers (or an instance dies), it may fail to rejoin the cluster. If it does, I've not found anything else I can do than to change the multicast address to something different, then bounce all three servers. I can restart the app server over and over again, and I get the same error when trying to start up JBoss Cache.

We currently only use the cache for a small amount of information - one node with 153 children, one with 249 children, and one with 266 children. Each child may have one, two, or three name/value pairs added to it. When everything is working, both reads and updates are blazing fast. The only problem is the sometimes complete and utter failure to rejoin the cluster.

The environment details:

O/S: SunOS 5.10 Generic_138888-06 sun4us sparc FJSV,GPUZC-M
Java: Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_10-b03)
App Server: WebLogic Server 9.2 MP2  Mon Jun 25 01:32:01 EDT 2007 952826
JBoss Cache: jbosscache-core-3.0.3GA
Jars installed in Weblogic's domain lib: jboss-common-core.jar jboss-logging-spi.jar jbosscache-core.jar jcip-annotations.jar jgroups.jar (all from 3.0.3GA download)

Some additional IP/routing details on the three instances (just being thorough):


  | server1 (ip 10.16.106.221 netmask 255.255.255.0) netstat -nr output:
  | 
  | Routing Table: IPv4
  |   Destination           Gateway           Flags  Ref     Use     Interface 
  | -------------------- -------------------- ----- ----- ---------- --------- 
  | default              10.16.106.190        UG        1     238875           
  | 10.16.106.0          10.16.106.220        U         1     270565 fjgi0     
  | 224.0.0.0            10.16.106.220        U         1          0 fjgi0     
  | 127.0.0.1            127.0.0.1            UH     1347    8972376 lo0       
  | 
  | server2 (ip 10.16.106.221 netmask 255.255.255.0) netstat -nr output:
  | 
  | Routing Table: IPv4
  |   Destination           Gateway           Flags  Ref     Use     Interface 
  | -------------------- -------------------- ----- ----- ---------- --------- 
  | default              10.16.106.190        UG        1     339798           
  | 10.16.106.0          10.16.106.221        U         1    1248223 fjgi0     
  | 224.0.0.0            10.16.106.221        U         1          0 fjgi0     
  | 127.0.0.1            127.0.0.1            UH     1362   10112605 lo0       
  | 
  | 
  | server3 (ip 10.16.106.222 netmask 255.255.255.0) netstat -nr output:
  | 
  | Routing Table: IPv4
  |   Destination           Gateway           Flags  Ref     Use     Interface 
  | -------------------- -------------------- ----- ----- ---------- --------- 
  | default              10.16.106.190        UG        1     346621           
  | 10.16.106.0          10.16.106.222        U         1     437006 fjgi0     
  | 224.0.0.0            10.16.106.222        U         1          0 fjgi0     
  | 127.0.0.1            127.0.0.1            UH     1186   10364437 lo0       
  | 

Now the jboss-cache.xml config that is used by each of the three instances:


  | <?xml version="1.0" encoding="UTF-8" ?>
  | 
  | <server>
  |    <mbean code="org.jboss.cache.pojo.jmx.PojoCacheJmxWrapper" 
  |           name="jboss.cache:service=PojoCache">
  |       
  |       <depends>jboss:service=TransactionManager</depends>
  | 
  |       <!-- Configure the TransactionManager -->
  |       <attribute name="TransactionManagerLookupClass">
  |          org.jboss.cache.transaction.DummyTransactionManagerLookup
  |       </attribute>
  | 
  |       <!-- Isolation level : SERIALIZABLE
  |                              REPEATABLE_READ (default)
  |                              READ_COMMITTED
  |                              READ_UNCOMMITTED
  |                              NONE
  |       -->
  |       <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
  | 
  |       <!-- Valid modes are LOCAL, REPL_ASYNC and REPL_SYNC -->
  |       <attribute name="CacheMode">REPL_ASYNC</attribute>
  | 
  |       <!-- Name of cluster. Needs to be the same for all caches, 
  |            in order for them to find each other
  |       -->
  |       <attribute name="ClusterName">prodMwCluster</attribute>
  | 
  |           <!-- JGroups protocol stack properties. -->
  |       <attribute name="ClusterConfig">
  |          <config>
  |             <!-- UDP: if you have a multihomed machine, set the bind_addr 
  |                  attribute to the appropriate NIC IP address
  | -->
  |             <!-- UDP: On Windows machines, because of the media sense feature
  |                  being broken with multicast (even after disabling media sense)
  |                  set the loopback attribute to true
  | -->
  |             <UDP mcast_addr="228.16.106.2" mcast_port="48863"
  |                  ip_ttl="64" ip_mcast="true"
  |                  mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
  |                  ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
  |                  loopback="false"/>
  |             <PING timeout="2000" num_initial_members="3"/>
  |             <MERGE2 min_interval="10000" max_interval="20000"/>
  |             <FD shun="true"/>
  |             <FD_SOCK/>
  |             <VERIFY_SUSPECT timeout="1500"/>
  |             <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
  |                            max_xmit_size="8192"/>
  |             <UNICAST timeout="600,1200,2400,4800"/>
  |             <pbcast.STABLE desired_avg_gossip="400000"/>
  |             <FC max_credits="2000000" min_threshold="0.10"/>
  |             <FRAG2 frag_size="8192"/>
  |             <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
  |                         shun="true" print_local_addr="true"/>
  |             <pbcast.STATE_TRANSFER/>
  |          </config>
  |       </attribute>
  | 
  |       <!-- Whether or not to fetch state on joining a cluster -->
  |       <attribute name="FetchInMemoryState">true</attribute>
  | 
  |       <!-- The max amount of time (in milliseconds) we wait until the
  |            initial state (ie. the contents of the cache) are retrieved from
  |            existing members in a clustered environment
  |       -->
  |       <attribute name="InitialStateRetrievalTimeout">15000</attribute>
  | 
  |       <!-- Number of milliseconds to wait until all responses for a
  |            synchronous call have been received.
  |       -->
  |       <attribute name="SyncReplTimeout">15000</attribute>
  | 
  |       <!--  Max number of milliseconds to wait for a lock acquisition -->
  |       <attribute name="LockAcquisitionTimeout">10000</attribute>
  |    
  |    </mbean>
  | </server>
  | 
  | 

I created startup/shutdown classes for WebLogic that create the Cache instance and place it in JNDI. I won't post the entire code here, but the cache creation code in the startup class looks like this:


  |    System.out.println("JBossCache - starting up...");
  |    CacheFactory<String, String> factory = new DefaultCacheFactory<String, String>();
  |    // configFile is jboss-cache.xml
  |    Cache<String, String> cache = factory.createCache(configFile, true);
  |    System.out.println("JbossCache - started cache");
  |    // put cache into JNDI...
  | 

The corresponding shutdown class code snippet looks like this:


  |    // grabbed cache out of JNDI and unbound it from there...
  |    cache.stop();
  |    cache.destroy();
  |    System.out.println("JbossCache - stopped cache.");
  | 

Below is an example of the error that occurred this past weekend on the 2nd of the three servers. The server needed to be bounced for an unrelated configuration change, and upon startup an error was generated when the JBossCacheLoader class fired on startup (this is from the WebLogic system out logs):


JBossCache - starting up...
  | 
  | -------------------------------------------------------
  | GMS: address is 10.16.106.221:34622
  | -------------------------------------------------------
  | 
  | 
  | (approximately 10 seconds elapse then)
  | 
  | 
  | <May 1, 2009 9:39:23 PM CDT> <Critical> <WebLogicServer> <BEA-000362> <Server failed. Reason: 
  | 
  | There are 1 nested errors:
  | 
  | org.jboss.cache.CacheException: java.lang.reflect.InvocationTargetException
  |         at org.jboss.cache.util.reflect.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:148)
  |         at org.jboss.cache.factories.ComponentRegistry$PrioritizedMethod.invoke(ComponentRegistry.java:883)
  |         at org.jboss.cache.factories.ComponentRegistry.internalStart(ComponentRegistry.java:680)
  |         at org.jboss.cache.factories.ComponentRegistry.start(ComponentRegistry.java:561)
  |         at org.jboss.cache.invocation.CacheInvocationDelegate.start(CacheInvocationDelegate.java:301)
  |         at org.jboss.cache.DefaultCacheFactory.createCache(DefaultCacheFactory.java:119)
  |         at org.jboss.cache.DefaultCacheFactory.createCache(DefaultCacheFactory.java:94)
  |         at com.company.cache.weblogic.JBossCacheStartup.main(JBossCacheStartup.java:41)
  |         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  |         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  |         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  |         at java.lang.reflect.Method.invoke(Method.java:585)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeMain(ClassDeploymentManager.java:353)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeClass(ClassDeploymentManager.java:263)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager.access$000(ClassDeploymentManager.java:54)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager$1.run(ClassDeploymentManager.java:205)
  |         at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
  |         at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:121)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeClassDeployment(ClassDeploymentManager.java:198)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager.invokeClassDeployments(ClassDeploymentManager.java:177)
  |         at weblogic.management.deploy.classdeployment.ClassDeploymentManager.runStartupsBeforeAppActivation(ClassDeploymentManager.java:151)
  |         at weblogic.management.deploy.internal.DeploymentAdapter$4.activate(DeploymentAdapter.java:166)
  |         at weblogic.management.deploy.internal.AppTransition$2.transitionApp(AppTransition.java:30)
  |         at weblogic.management.deploy.internal.ConfiguredDeployments.transitionApps(ConfiguredDeployments.java:233)
  |         at weblogic.management.deploy.internal.ConfiguredDeployments.activate(ConfiguredDeployments.java:169)
  |         at weblogic.management.deploy.internal.ConfiguredDeployments.deploy(ConfiguredDeployments.java:123)
  |         at weblogic.management.deploy.internal.DeploymentServerService.resume(DeploymentServerService.java:173)
  |         at weblogic.management.deploy.internal.DeploymentServerService.start(DeploymentServerService.java:89)
  |         at weblogic.t3.srvr.SubsystemRequest.run(SubsystemRequest.java:64)
  |         at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
  |         at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
  | Caused by: java.lang.reflect.InvocationTargetException
  |         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  |         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  |         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  |         at java.lang.reflect.Method.invoke(Method.java:585)
  |         at org.jboss.cache.util.reflect.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:144)
  |         ... 30 more
  | Caused by: org.jboss.cache.CacheException: Unable to connect to JGroups channel
  |         at org.jboss.cache.RPCManagerImpl.start(RPCManagerImpl.java:252)
  |         ... 35 more
  | Caused by: org.jgroups.StateTransferException: 10.16.106.221:34622 could not fetch state null from null
  |         at org.jgroups.JChannel.connect(JChannel.java:466)
  |         at org.jboss.cache.RPCManagerImpl.start(RPCManagerImpl.java:242)
  |         ... 35 more
  | Caused by: org.jgroups.StateTransferException: 10.16.106.221:34622 could not fetch state null from null
  |         at org.jgroups.JChannel.connect(JChannel.java:459)
  |         ... 36 more
  | 
  | 
Any idea what could be causing this problem, given my configuration? I may try the JGroups probe script to see if it can tell me any more information. Otherwise I am completely at a loss. Sometimes restarts work ok, but it seems that once one fails, they will continue to fail until they are all restarted with a new multicast IP.

Also, say server 1 bounces and then fails, 2 will do the same thing if we bounce it. They all have to have config changed and bounced. Then they all talk to each other again and are happy. 

Thanks for any insight.

-Scott


View the original post : http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4229099#4229099

Reply to the post : http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=4229099



More information about the jboss-user mailing list