Tagged 3.0.1.GA
by Manik Surtani
FYI, I have just tagged 3.0.1.GA which contains some bug fixes and a
performance tweak pertaining to passivation.
* [ JBCACHE-1444 ] ObjectName's validation fails for Jbosscache
3.0 on WAS 6.1 due to ":" char in name.
* [ JBCACHE-1445 ] Data gravitation cleanup does not happen when
using single-phase commits.
* [ JBCACHE-1448 ] Jdbm and BDBJE cache loader incorrectly
reading database name from location String
* [ JBCACHE-1446 ] Optimize activations by minimizing calls to
cacheloader.exists()
This should be released shortly.
Cheers,
--
Manik Surtani
Lead, JBoss Cache
http://www.jbosscache.org
manik(a)jboss.org
15 years, 12 months
Optimize use of CacheLoader.exists() with passivation
by Brian Stansberry
I've been profiling JBoss AS web session replication, and one of the
significant hits I'm seeing is from calls to File.exists() from
FileCacheLoader (see attached). In turn, those calls are due calls from
LegacyActivationInterceptor, particularly the
removeNodeFromCacheLoader() method, which is called with every
invocation.[1] Many of these calls are on the most critical path[2] so
speeding them can have large implications for overall cluster performance.
I'm wondering if we can be smarter here and avoid most calls to
removeNodeFromCacheLoader()?
Is it a correct statement that, logically, with passivation the only
time it would make sense to remove a node from the cache loader is if
that invocation loaded data from the cache loader? The node remove is
allowed because the in-memory data has become complete; the only way
previously incomplete data can become complete is if a request has
loaded data. So, if we can pass back to the activation interceptor info
on whether a load has occurred (e.g. via a simple boolean flag in
InvocationContext), we can avoid most calls to CacheLoader.exists().
Thoughts?
[1] ActivationInterceptor would have the same problem if I were using MVCC.
[2] Up calls from JGroups. NAKACK only allows one such call per peer at
a time, and FC blocks web request threads based on the handling of these
calls, so the speed of these calls is the most critical path in the
whole system.
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com
15 years, 12 months
Re: [jbosscache-dev] Cache unable to write to cluster
by Manik Surtani
On 12 Nov 2008, at 10:25, Vladimir Blagojevic wrote:
> Manik Surtani wrote:
>>
>> Yes, this was always an issue with the way we used FLUSH - that
>> someone in the group could initiate a FLUSH and then die leaving
>> other members' flushBlockGates closed. TBH, apart from adding
>> timeouts to the flushBlockGate, I can't see how we would get around
>> this.
>
> Me too. I am confused how all these issues started to pop out now.
> How come they slipped for so long?
Yes, because in theory they could have happened with JBC 2.x, 1.x, etc.
>> Vladimir/Bela - in the scenario described (node initiates a FLUSH
>> and then dies) would other nodes still see a view change relating
>> to the node dying?
>
> They would. However, I will add this test case to verify it.
Ok - so we could add an extra check into the view change listener to
force an unblock if a member who initiated a FLUSH dies. We would
also have to record the address of the member initiating the FLUSH in
the flushBlockGate.
Cheers
--
Manik Surtani
Lead, JBoss Cache
manik(a)jboss.org
16 years
Re: [jbosscache-dev] Cache unable to write to cluster
by Bela Ban
Can't we unlock the blocked flushgates when a node leaves/crashes ?
Vladimir Blagojevic wrote:
> Manik Surtani wrote:
>>
>> Yes, this was always an issue with the way we used FLUSH - that
>> someone in the group could initiate a FLUSH and then die leaving
>> other members' flushBlockGates closed. TBH, apart from adding
>> timeouts to the flushBlockGate, I can't see how we would get around
>> this.
>
> Me too. I am confused how all these issues started to pop out now. How
> come they slipped for so long?
>>
>> Vladimir/Bela - in the scenario described (node initiates a FLUSH and
>> then dies) would other nodes still see a view change relating to the
>> node dying?
>
> They would. However, I will add this test case to verify it.
>
> Cheers.
>
>
>
--
Bela Ban
Lead JGroups / Clustering Team
JBoss - a division of Red Hat
16 years
Cache unable to write to cluster
by Brian Stansberry
We just found an intermittent failure in the EJB3 testsuite[1] that's
more a JBC or JGroups issue. This is with JBC 3.0.0.CR4 and JG 2.6.6.
I'm speculating it relates to FLUSH work Vladimir's been doing[2][3].
Issue is an inability to replicate a put:
Caused by: org.jboss.cache.lock.TimeoutException: State retrieval timed
out waiting for flush unblock.
at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:455)
at ....
org.jboss.cache.invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java:560)
at
org.jboss.ha.cachemanager.CacheManagerManagedCache.put(CacheManagerManagedCache.java:285)
at
org.jboss.ejb3.cache.tree.StatefulTreeCache.putInCache(StatefulTreeCache.java:511)
at
org.jboss.ejb3.cache.tree.StatefulTreeCache.create(StatefulTreeCache.java:123)
... 70 more
Looking at RPCManagerImpl.java:455 we have:
if (channel.flushSupported() &&
!flushBlockGate.await(configuration.getStateRetrievalTimeout(),
TimeUnit.MILLISECONDS))
{
throw new TimeoutException("State retrieval timed out waiting for
flush unblock.");
}
Basically, failing on flushBlockGate.await(). Looking at use of
flushBlockGate, the gate is closed in block() and opened in unblock().
*Assuming* no bug in ReclosableLatch, seems like block() is getting
called here with no subsequent call to unblock(). (Unfortunately, logs
related to this failure are gone, so I can't prove that.)
Questions:
1) Vladimir, could the JGRP-855 issue result in block() getting called
with no subsequent call to unblock(), either on the flush coordinator or
on one of the other nodes? If yes, your JGRP-855 fix will probably fix
this as well.
2) Looking at RPCManagerImpl.start(), it does a connect+state transfer
in a try/catch where any failure should result in a CacheException being
thrown from start(). That CacheException should have prevented
deployment of the ejb; i.e. the call shown in the stack trace above
shouldn't have happened. Only way I see it could have happened is if the
node that threw above exception wasn't the flush coordinator; i.e. its
cache started fine, but a problem on another node led to its block()
being called with no matching unblock(). That's a big issue too, as it
means a failure in one node can take down the entire cluster by leaving
everyone's flushBlockGate closed.
[1] https://jira.jboss.org/jira/browse/EJBTHREE-1580
[2] https://jira.jboss.org/jira/browse/JGRP-855
[3] http://www.jboss.com/index.html?module=bb&op=viewtopic&t=145138
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com
16 years