[jboss-user] [JBossCache] - configuration question: how to limit size of NAKACK structur

Fri Dec 1 06:32:24 EST 2006

I am running into an issue in the case when something is wrong with several nodes in the cluster, and the surviving node somehow does not evict the troublesome nodes and starts accumulating messages.

The current config looks like this:

  |                 <property name="isolationLevel" value="REPEATABLE_READ" />
  |                 <property name="cacheMode" value="REPL_ASYNC" />
  |                 <property name="clusterName" value="${treeCache.clusterName}" />
  |                 <property name="useReplQueue" value="false" />
  |                 <property name="replQueueInterval" value="0" />
  |                 <property name="replQueueMaxElements" value="0" />
  |                 <property name="fetchInMemoryState" value="true" />
  |                 <property name="initialStateRetrievalTimeout" value="20000" />
  |                 <property name="syncReplTimeout" value="20000" />
  |                 <property name="lockAcquisitionTimeout" value="5000" />
  |                 <property name="useRegionBasedMarshalling" value="false" />
  |                 <property name="clusterProperties"
  |                         value="${treeCache.clusterProperties}" />
  |                 <property name="serviceName">
  |                         <bean class="javax.management.ObjectName">
  |                                 <constructor-arg value="jboss.cache:service=${treeCache.clusterName},name=${treeCache.instanceName}"/>
  |                         </bean>
  |                 </property>
  |                 <property name="evictionPolicyClass" value="org.jboss.cache.eviction.LRUPolicy"/>
  |                 <property name="maxAgeSeconds" value="${treeCache.eviction.maxAgeSeconds}"/>
  |                 <property name="maxNodes" value="${treeCache.eviction.maxNodes}"/>
  |                 <property name="timeToLiveSeconds" value="${treeCache.eviction.timeToLiveSeconds}"/>
  | 

The jgroups stack is this:

  | treeCache.clusterProperties=UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=${treeCache.mcastAddress};mcast_port=${treeCache.mcastPort};mcast_recv_buf_
  | size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000;bind_addr=${treeCache.bind_addr}):\
  | PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):\
  | MERGE2(max_interval=20000;min_interval=10000):\
  | FD_SOCK(down_thread=false;up_thread=false):\
  | VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):\
  | pbcast.NAKACK(down_thread=false;gc_lag=50;retransmit_timeout=600,1200,2400,4800;up_thread=false):\
  | pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):\
  | UNICAST(down_thread=false;;timeout=600,1200,2400):\
  | FRAG(down_thread=false;frag_size=8192;up_thread=false):\
  | pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):\
  | pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)
  | 

The cluster has 12 nodes, and I had this situation occur when 3 of the nodes failed, which provoked the ops team into restarting 9 of them. The remaning 3 all went OOM quickly. Analysing the heap dump post-mortem, I see this:

org.jgroups.protocols.pbcast.NAKACK retained size=245MB

My first step is to add FD into the stack to adress the issue of failure detection not working properly in some cases. Then I would like to limit the size of the NAKACK structure (even if this means losing consistency accross the cluster): is this possible at all? What are your suggestions?

View the original post : http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3990413#3990413

Reply to the post : http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3990413