[JBossCache] - configuration question: how to limit size of NAKACK structur - jboss-user

Friday, 1 December 2006

I am running into an issue in the case when something is wrong with several nodes in the
cluster, and the surviving node somehow does not evict the troublesome nodes and starts
accumulating messages.

The current config looks like this:

  |                 <property name="isolationLevel"
value="REPEATABLE_READ" />
  |                 <property name="cacheMode" value="REPL_ASYNC"
/>
  |                 <property name="clusterName"
value="${treeCache.clusterName}" />
  |                 <property name="useReplQueue" value="false"
/>
  |                 <property name="replQueueInterval" value="0"
/>
  |                 <property name="replQueueMaxElements" value="0"
/>
  |                 <property name="fetchInMemoryState"
value="true" />
  |                 <property name="initialStateRetrievalTimeout"
value="20000" />
  |                 <property name="syncReplTimeout" value="20000"
/>
  |                 <property name="lockAcquisitionTimeout"
value="5000" />
  |                 <property name="useRegionBasedMarshalling"
value="false" />
  |                 <property name="clusterProperties"
  |                         value="${treeCache.clusterProperties}" />
  |                 <property name="serviceName">
  |                         <bean class="javax.management.ObjectName">
  |                                 <constructor-arg
value="jboss.cache:service=${treeCache.clusterName},name=${treeCache.instanceName}"/>
  |                         </bean>
  |                 </property>
  |                 <property name="evictionPolicyClass"
value="org.jboss.cache.eviction.LRUPolicy"/>
  |                 <property name="maxAgeSeconds"
value="${treeCache.eviction.maxAgeSeconds}"/>
  |                 <property name="maxNodes"
value="${treeCache.eviction.maxNodes}"/>
  |                 <property name="timeToLiveSeconds"
value="${treeCache.eviction.timeToLiveSeconds}"/>
  | 

The jgroups stack is this:

  |
treeCache.clusterProperties=UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=${treeCache.mcastAddress};mcast_port=${treeCache.mcastPort};mcast_recv_buf_
  |
size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000;bind_addr=${treeCache.bind_addr}):\
  | PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):\
  | MERGE2(max_interval=20000;min_interval=10000):\
  | FD_SOCK(down_thread=false;up_thread=false):\
  | VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):\
  |
pbcast.NAKACK(down_thread=false;gc_lag=50;retransmit_timeout=600,1200,2400,4800;up_thread=false):\
  | pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):\
  | UNICAST(down_thread=false;;timeout=600,1200,2400):\
  | FRAG(down_thread=false;frag_size=8192;up_thread=false):\
  |
pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):\
  | pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)
  | 

The cluster has 12 nodes, and I had this situation occur when 3 of the nodes failed, which
provoked the ops team into restarting 9 of them. The remaning 3 all went OOM quickly.
Analysing the heap dump post-mortem, I see this:

org.jgroups.protocols.pbcast.NAKACK retained size=245MB

My first step is to add FD into the stack to adress the issue of failure detection not
working properly in some cases. Then I would like to limit the size of the NAKACK
structure (even if this means losing consistency accross the cluster): is this possible at
all? What are your suggestions?

View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3990413#...

Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

[JBossCache] - configuration question: how to limit size of NAKACK structur