[infinispan-dev] Question about ISPN-2376 "KeyAffinityServiceImpl.getKeyForAddress() seems to loop forever when DefaultConsistentHash is created for the non-local node owner"

Scott Marlow smarlow at redhat.com
Wed Oct 10 09:47:43 EDT 2012


On 10/10/2012 06:47 AM, Dan Berindei wrote:
> Hi Scott
>
> On Wed, Oct 10, 2012 at 6:20 AM, Scott Marlow <smarlow at redhat.com
> <mailto:smarlow at redhat.com>> wrote:
>
>     I'm trying to understand more about whether it makes sense for a
>     DefaultConsistentHash to be created with a non-local owner specified in
>     the DefaultConsistentHash constructor "segmentOwners" parameter.
>
>
> It definitely makes sense for such a DefaultConsistentHash to exist
> while the cache is starting. But by the time the cache has started (i.e.
> getCache() has returned), it should have been replaced with a
> DefaultConsistentHash that contains the local node as well.
>
>     During some AS7 cluster testing that I'm running on my machine, I'm
>     seeing the test stall because we loop endlessly in
>     KeyAffinityServiceImpl.getKeyForAddress().  We loop because
>     KeyAffinityServiceImpl.generateKeys() doesn't add any keys.
>
>     We don't generate any keys because
>     DefaultConsistentHash.locatePrimaryOwnerForSegment() returns address
>     "node-1/web" which never matches the local nodes filter
>     (KeyAffinityServiceImpl.interestedInAddress() only filters for local
>     owners via "node-0/web").
>
>     http://pastie.org/5027574 shows the call stack for the
>     DefaultConsistentHash constructor that is the same instance that is used
>     above.  If you look at the call stack, it looks like the
>     DefaultConsistentHash instance may of being serialized on the other node
>     and sent over (which would explain why its owner is "node-1/web" but
>     still not sure why/how it comes into play with local
>     KeyAffinityServiceImpl.generateKeys()).
>
>
> My guess is you're able to access the cache before it has finished
> starting, and the KeyAffinityService doesn't know how to deal with a
> cache that doesn't have any local state yet. Again, this should not

I instrumented the DefaultConsistentHash constructor to call 
thread.dumpStack() only if the owner is "node-1/web" (so I could track 
the origin of the wrong DefaultConsistentHash instance being used).

Currently, I also have INFO level logging in the DefaultConsistentHash 
ctor that always shows:

"
DefaultConsistentHash ctor this=DefaultConsistentHash{numSegments=1, 
numOwners=2, members=[node-1/web, node-0/web], segmentOwners={0: 0 
1}system identityHashCode=108706475,show segmentOwners[0 of 1] = 
[node-1/web, node-0/web]

DefaultConsistentHash ctor numSegments=1, numOwners=2

DefaultConsistentHash ctor this.segmentOwners[0][0] = node-1/web
"

Since this testing involves multiple tests 
(org.jboss.as.test.clustering.cluster.singleton.SingletonTestCase,org.jboss.as.test.clustering.cluster.web.ReplicationWebFailoverTestCase,org.jboss.as.test.clustering.cluster.web.GranularWebFailoverTestCase,org.jboss.as.test.clustering.cluster.web.passivation.SessionBasedSessionPassivationTestCase,org.jboss.as.test.clustering.cluster.web.passivation.AttributeBasedSessionPassivationTestCase,org.jboss.as.test.clustering.cluster.web.DistributionWebFailoverTestCase), 
its not surprising to see that we reach the DefaultConsistentHash 
constructor 12 times.  The segment owners for the 12 constructors are in 
the following:

1.  [node-0/web]
2.  [node-0/web]
3.  [node-0/web, node-1/web]
4.  [node-0/web, node-1/web]
5.  [node-1/web]
6.  [node-1/web]
7.  [node-1/web]
8.  [node-1/web, node-0/web]
9.  [node-1/web, node-0/web]
10. [node-1/web, node-0/web] (we use this one when stuck in a loop)
11. [node-0/web]
12. [node-0/web]

We are using the #10 DefaultConsistentHash constructed instance from 
above for several minutes to an hour (if I let the test run that long 
while the KeyAffinityServiceImpl.getKeyForAddress() continues in the loop).

Could there be a problem with the ordering of the segment owners?  or is 
it more that we never switch to use #11/#12 that is likely to be the 
timing problem?

> happen - getCache() should not return that soon - but it could be that
> it does happen when multiple threads try to start the same cache in
> parallel. Can you post logs with TRACE enabled for org.infinispan and/or
> a link to your test code?
>

Sure, I can enable TRACE for Infinispan and attach the logs to the 
ISPN-2376 jira.  I'll add links to the test code there as well (as 
comments).

Also, KeyAffinityServiceImpl.generateKeys() contains:

"
// if we had too many misses, just release the lock and try again
if (missCount < maxMisses) {
"

I tried changing the above to a ">=" check and also tried removing the 
check (just did the keyProducerStartLatch.close()) neither of which had 
direct impact on the current problem.


> Cheers
> Dan
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>



More information about the infinispan-dev mailing list