On 10/10/2012 06:47 AM, Dan Berindei wrote:
<mailto:smarlow@redhat.com>> wrote:
I'm trying to understand more about whether it makes sense for a
DefaultConsistentHash to be created with a non-local owner specified in
the DefaultConsistentHash constructor "segmentOwners" parameter.
It definitely makes sense for such a DefaultConsistentHash to exist
while the cache is starting. But by the time the cache has started (i.e.
getCache() has returned), it should have been replaced with a
DefaultConsistentHash that contains the local node as well.
During some AS7 cluster testing that I'm running on my machine, I'm
seeing the test stall because we loop endlessly in
KeyAffinityServiceImpl.getKeyForAddress(). We loop because
KeyAffinityServiceImpl.generateKeys() doesn't add any keys.
We don't generate any keys because
DefaultConsistentHash.locatePrimaryOwnerForSegment() returns address
"node-1/web" which never matches the local nodes filter
(KeyAffinityServiceImpl.interestedInAddress() only filters for local
owners via "node-0/web").
http://pastie.org/5027574 shows the call stack for the
DefaultConsistentHash constructor that is the same instance that is used
above. If you look at the call stack, it looks like the
DefaultConsistentHash instance may of being serialized on the other node
and sent over (which would explain why its owner is "node-1/web" but
still not sure why/how it comes into play with local
KeyAffinityServiceImpl.generateKeys()).
My guess is you're able to access the cache before it has finished
starting, and the KeyAffinityService doesn't know how to deal with a
cache that doesn't have any local state yet. Again, this should not
I instrumented the DefaultConsistentHash constructor to call thread.dumpStack() only if the owner is "node-1/web" (so I could track the origin of the wrong DefaultConsistentHash instance being used).
Currently, I also have INFO level logging in the DefaultConsistentHash ctor that always shows:
"
DefaultConsistentHash ctor this=DefaultConsistentHash{numSegments=1, numOwners=2, members=[node-1/web, node-0/web], segmentOwners={0: 0 1}system identityHashCode=108706475,show segmentOwners[0 of 1] = [node-1/web, node-0/web]
DefaultConsistentHash ctor numSegments=1, numOwners=2
DefaultConsistentHash ctor this.segmentOwners[0][0] = node-1/web
"
Since this testing involves multiple tests (org.jboss.as.test.clustering.cluster.singleton.SingletonTestCase,org.jboss.as.test.clustering.cluster.web.ReplicationWebFailoverTestCase,org.jboss.as.test.clustering.cluster.web.GranularWebFailoverTestCase,org.jboss.as.test.clustering.cluster.web.passivation.SessionBasedSessionPassivationTestCase,org.jboss.as.test.clustering.cluster.web.passivation.AttributeBasedSessionPassivationTestCase,org.jboss.as.test.clustering.cluster.web.DistributionWebFailoverTestCase), its not surprising to see that we reach the DefaultConsistentHash constructor 12 times. The segment owners for the 12 constructors are in the following:
1. [node-0/web]
2. [node-0/web]
3. [node-0/web, node-1/web]
4. [node-0/web, node-1/web]
5. [node-1/web]
6. [node-1/web]
7. [node-1/web]
8. [node-1/web, node-0/web]
9. [node-1/web, node-0/web]
10. [node-1/web, node-0/web] (we use this one when stuck in a loop)
11. [node-0/web]
12. [node-0/web]
We are using the #10 DefaultConsistentHash constructed instance from above for several minutes to an hour (if I let the test run that long while the KeyAffinityServiceImpl.getKeyForAddress() continues in the loop).
Could there be a problem with the ordering of the segment owners? or is it more that we never switch to use #11/#12 that is likely to be the timing problem?
Sure, I can enable TRACE for Infinispan and attach the logs to the ISPN-2376 jira. I'll add links to the test code there as well (as comments).
happen - getCache() should not return that soon - but it could be that
it does happen when multiple threads try to start the same cache in
parallel. Can you post logs with TRACE enabled for org.infinispan and/or
a link to your test code?
Also, KeyAffinityServiceImpl.generateKeys() contains:
"
// if we had too many misses, just release the lock and try again
if (missCount < maxMisses) {
"
I tried changing the above to a ">=" check and also tried removing the check (just did the keyProducerStartLatch.close()) neither of which had direct impact on the current problem.