I'm using JBossCache 1.4.1SP4 embedded into a server based application (ie, not using
JBoss App Server). We recently saw a problem that lead to all updates to the cache to
fail.
We're running two servers with each server running JBossCache with replication and
each server has their own cacheloader configured for a local cache loader using the
BdbjeCacheLoader.
Everything was running fine for some time with many updates when we got an exception when
trying to commit a transaction:
| java.lang.RuntimeException:
| at
org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1173)
| at
org.jboss.cache.interceptors.OrderedSynchronizationHandler.beforeCompletion(OrderedSynchronizationHandler.java:75)
| at
org.objectweb.jotm.SubCoordinator.doBeforeCompletion(SubCoordinator.java:1487)
| at
org.objectweb.jotm.SubCoordinator.commit_one_phase(SubCoordinator.java:416)
| at org.objectweb.jotm.TransactionImpl.commit(TransactionImpl.java:239)
| at org.objectweb.jotm.Current.commit(Current.java:488)
| at
danger.service.tpns.SubscriptionManager.subscribe(SubscriptionManager.java:198)
| at
danger.service.tpns.SubscriptionManager.subscribeDevice(SubscriptionManager.java:150)
| at danger.service.tpns.ThirdPartyCalls.subscribe(ThirdPartyCalls.java:90)
| at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
| at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
| at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
| at java.lang.reflect.Method.invoke(Method.java:585)
| at org.apache.xmlrpc.Invoker.execute(Unknown Source)
| at org.apache.xmlrpc.XmlRpcWorker.invokeHandler(Unknown Source)
| at org.apache.xmlrpc.XmlRpcWorker.execute(Unknown Source)
| at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
| at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
| at org.apache.xmlrpc.WebServer$Connection.run(Unknown Source)
| at org.apache.xmlrpc.WebServer$Runner.run(Unknown Source)
| at java.lang.Thread.run(Thread.java:595)
| Caused by: org.jboss.cache.ReplicationException: rsp=sender=10.253.229.164:44284,
retval=null, received=false, suspected=false
| at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4436)
| at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4358)
| at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4469)
| at
org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
| at
org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
| at
org.jboss.cache.interceptors.ReplicationInterceptor.runPreparePhase(ReplicationInterceptor.java:152)
| at
org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:69)
| at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
| at
org.jboss.cache.interceptors.CacheStoreInterceptor.invoke(CacheStoreInterceptor.java:136)
| at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
| at
org.jboss.cache.interceptors.TxInterceptor.runPreparePhase(TxInterceptor.java:873)
| at
org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1147)
| ... 20 more
| Caused by: org.jboss.cache.lock.TimeoutException: Response timed out:
sender=10.253.229.164:44284, retval=null, received=false, suspected=false
| at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4434)
| ... 31 more
|
Since that exception, all attempts to update the cache in either server replica fails with
the following:
| Cannot add subscription to cache
| java.lang.RuntimeException:
| at
org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1173)
| at
org.jboss.cache.interceptors.OrderedSynchronizationHandler.beforeCompletion(OrderedSynchronizationHandler.java:75)
| at
org.objectweb.jotm.SubCoordinator.doBeforeCompletion(SubCoordinator.java:1487)
| at
org.objectweb.jotm.SubCoordinator.commit_one_phase(SubCoordinator.java:416)
| at org.objectweb.jotm.TransactionImpl.commit(TransactionImpl.java:239)
| at org.objectweb.jotm.Current.commit(Current.java:488)
| at
danger.service.tpns.SubscriptionManager.subscribe(SubscriptionManager.java:198)
| at
danger.service.tpns.SubscriptionManager.subscribeDevice(SubscriptionManager.java:150)
| at danger.service.tpns.ThirdPartyCalls.subscribe(ThirdPartyCalls.java:90)
| at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
| at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
| at java.lang.reflect.Method.invoke(Method.java:585)
| at org.apache.xmlrpc.Invoker.execute(Unknown Source)
| at org.apache.xmlrpc.XmlRpcWorker.invokeHandler(Unknown Source)
| at org.apache.xmlrpc.XmlRpcWorker.execute(Unknown Source)
| at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
| at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
| at org.apache.xmlrpc.WebServer$Connection.run(Unknown Source)
| at org.apache.xmlrpc.WebServer$Runner.run(Unknown Source)
| at java.lang.Thread.run(Thread.java:595)
| Caused by: com.sleepycat.je.DeadlockException: (JE 3.2.13) Lock expired. Locker
716_Thread-7_Txn: waited for lock on database=tpns node=71 type=WRITE grant=WAIT_NEW
timeoutMillis=500 startTime=1190284084053 endTime=1190284084555
| Owners: [<LockInfo locker="645_UpHandler (STATE_TRANSFER)_Txn"
type="WRITE"/>]
| Waiters: []
|
| at com.sleepycat.je.txn.LockManager.lock(LockManager.java:266)
| at com.sleepycat.je.txn.Txn.lockInternal(Txn.java:296)
| at com.sleepycat.je.txn.Locker.lock(Locker.java:257)
| at
com.sleepycat.je.dbi.CursorImpl.lockLNDeletedAllowed(CursorImpl.java:2349)
| at com.sleepycat.je.tree.Tree.insert(Tree.java:2575)
| at com.sleepycat.je.dbi.CursorImpl.putLN(CursorImpl.java:988)
| at com.sleepycat.je.dbi.CursorImpl.putNoOverwrite(CursorImpl.java:1039)
| at com.sleepycat.je.Cursor.putAllowPhantoms(Cursor.java:863)
| at com.sleepycat.je.Cursor.putNoNotify(Cursor.java:798)
| at com.sleepycat.je.Cursor.putInternal(Cursor.java:752)
| at com.sleepycat.je.Cursor.putNoOverwrite(Cursor.java:342)
| at
org.jboss.cache.loader.bdbje.BdbjeCacheLoader.createParentNodes(BdbjeCacheLoader.java:555)
| at
org.jboss.cache.loader.bdbje.BdbjeCacheLoader.doPut(BdbjeCacheLoader.java:392)
| at
org.jboss.cache.loader.bdbje.BdbjeCacheLoader.doPut(BdbjeCacheLoader.java:519)
| at
org.jboss.cache.loader.bdbje.BdbjeCacheLoader.performTransaction(BdbjeCacheLoader.java:764)
| at
org.jboss.cache.loader.bdbje.BdbjeCacheLoader.prepare(BdbjeCacheLoader.java:714)
| at
org.jboss.cache.loader.AsyncCacheLoader.prepare(AsyncCacheLoader.java:262)
| at
org.jboss.cache.interceptors.CacheStoreInterceptor.prepareCacheLoader(CacheStoreInterceptor.java:258)
| at
org.jboss.cache.interceptors.CacheStoreInterceptor.invoke(CacheStoreInterceptor.java:131)
| at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
| at
org.jboss.cache.interceptors.TxInterceptor.runPreparePhase(TxInterceptor.java:873)
| at
org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1147)
| ... 19 more
|
The cache configuration I'm using is as follows
| <?xml version="1.0" encoding="UTF-8"?>
| <server>
| <classpath codebase="/danger/local/lib"
archives="jboss-cache.jar,jgroups-all.jar" />
| <mbean code="org.jboss.cache.aop.PojoCache"
name="jboss.cache:service=PojoCache">
|
| <depends>jboss:service=Naming</depends>
| <depends>jboss:service=TransactionManager</depends>
| <attribute
name="TransactionManagerLookupClass">danger.service.cluster.JotmTransactionManagerLookup</attribute>
| <attribute
name="IsolationLevel">REPEATABLE_READ</attribute>
| <attribute name="CacheMode">REPL_SYNC</attribute>
| <attribute name="UseReplQueue">false</attribute>
| <attribute name="ReplQueueInterval">0</attribute>
| <attribute name="ReplQueueMaxElements">0</attribute>
| <attribute name="ClusterName">tpns</attribute>
|
| <attribute name="ClusterConfig">
| <config>
| <UDP mcast_addr="228.8.8.7" mcast_port="45567"
ip_ttl="64" ip_mcast="true"
| mcast_send_buf_size="150000"
mcast_recv_buf_size="80000"
| ucast_send_buf_size="150000"
ucast_recv_buf_size="80000" loopback="false" />
| <PING timeout="2000" num_initial_members="3"
up_thread="false" down_thread="false" />
| <MERGE2 min_interval="10000"
max_interval="20000" />
| <!-- <FD shun="true" up_thread="true"
down_thread="true" />-->
| <FD_SOCK />
| <VERIFY_SUSPECT timeout="1500"
up_thread="false" down_thread="false" />
| <pbcast.NAKACK gc_lag="50"
retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192"
| up_thread="false"
down_thread="false" />
| <UNICAST timeout="600,1200,2400"
window_size="100" min_threshold="10" down_thread="false"
/>
| <pbcast.STABLE desired_avg_gossip="20000"
up_thread="false" down_thread="false" />
| <FRAG frag_size="8192" down_thread="false"
up_thread="false" />
| <pbcast.GMS join_timeout="5000"
join_retry_timeout="2000" shun="true"
print_local_addr="true" />
| <pbcast.STATE_TRANSFER up_thread="true"
down_thread="true" />
| </config>
| </attribute>
|
| <attribute name="FetchStateOnStartup">true</attribute>
| <attribute
name="InitialStateRetrievalTimeout">5000</attribute>
| <attribute name="SyncReplTimeout">10000</attribute>
| <attribute
name="LockAcquisitionTimeout">15000</attribute>
| <attribute
name="EvictionPolicyClass">org.jboss.cache.aop.eviction.AopLRUPolicy</attribute>
| <attribute name="EvictionPolicyConfig">
| <config>
| <attribute
name="wakeUpIntervalSeconds">5</attribute>
| <region name="/_default_">
| <attribute name="maxNodes">0</attribute>
| <attribute
name="timeToLiveSeconds">0</attribute>
| </region>
| </config>
| </attribute>
|
| <attribute name="CacheLoaderConfiguration">
| <config>
| <passivation>false</passivation>
| <preload>/</preload>
| <shared>false</shared>
|
| <cacheloader>
|
<class>org.jboss.cache.loader.bdbje.BdbjeCacheLoader</class>
| <properties>location=/danger/tpns</properties>
| <async>false</async>
| <fetchPersistentState>true</fetchPersistentState>
| <ignoreModifications>false</ignoreModifications>
| <purgeOnStartup>false</purgeOnStartup>
| </cacheloader>
| </config>
| </attribute>
| </mbean>
| </server>
|
It seems to be the first failure which happen when attempting replication cause some lock
to be obtained in the BDBJE database and never released. Restarting both server seems to
have cleared things up, but I'm very nervous about it happening again. Anyone have
any ideas as to what might be going on or how to fix it?
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4087445#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...