<br><div class="gmail_quote">On Wed, Oct 10, 2012 at 4:47 PM, Scott Marlow <span dir="ltr">&lt;<a href="mailto:smarlow@redhat.com" target="_blank">smarlow@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On 10/10/2012 06:47 AM, Dan Berindei wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

Hi Scott<br>

<br>

On Wed, Oct 10, 2012 at 6:20 AM, Scott Marlow &lt;<a href="mailto:smarlow@redhat.com" target="_blank">smarlow@redhat.com</a><br></div><div><div class="h5">

&lt;mailto:<a href="mailto:smarlow@redhat.com" target="_blank">smarlow@redhat.com</a>&gt;&gt; wrote:<br>

<br>

    I&#39;m trying to understand more about whether it makes sense for a<br>

    DefaultConsistentHash to be created with a non-local owner specified in<br>

    the DefaultConsistentHash constructor &quot;segmentOwners&quot; parameter.<br>

<br>

<br>

It definitely makes sense for such a DefaultConsistentHash to exist<br>

while the cache is starting. But by the time the cache has started (i.e.<br>

getCache() has returned), it should have been replaced with a<br>

DefaultConsistentHash that contains the local node as well.<br>

<br>

    During some AS7 cluster testing that I&#39;m running on my machine, I&#39;m<br>

    seeing the test stall because we loop endlessly in<br>

    KeyAffinityServiceImpl.<u></u>getKeyForAddress().  We loop because<br>

    KeyAffinityServiceImpl.<u></u>generateKeys() doesn&#39;t add any keys.<br>

<br>

    We don&#39;t generate any keys because<br>

    DefaultConsistentHash.<u></u>locatePrimaryOwnerForSegment() returns address<br>

    &quot;node-1/web&quot; which never matches the local nodes filter<br>

    (KeyAffinityServiceImpl.<u></u>interestedInAddress() only filters for local<br>

    owners via &quot;node-0/web&quot;).<br>

<br>

    <a href="http://pastie.org/5027574" target="_blank">http://pastie.org/5027574</a> shows the call stack for the<br>

    DefaultConsistentHash constructor that is the same instance that is used<br>

    above.  If you look at the call stack, it looks like the<br>

    DefaultConsistentHash instance may of being serialized on the other node<br>

    and sent over (which would explain why its owner is &quot;node-1/web&quot; but<br>

    still not sure why/how it comes into play with local<br>

    KeyAffinityServiceImpl.<u></u>generateKeys()).<br>

<br>

<br>

My guess is you&#39;re able to access the cache before it has finished<br>

starting, and the KeyAffinityService doesn&#39;t know how to deal with a<br>

cache that doesn&#39;t have any local state yet. Again, this should not<br>

</div></div></blockquote>

<br>

I instrumented the DefaultConsistentHash constructor to call thread.dumpStack() only if the owner is &quot;node-1/web&quot; (so I could track the origin of the wrong DefaultConsistentHash instance being used).<br>

<br>

Currently, I also have INFO level logging in the DefaultConsistentHash ctor that always shows:<br>

<br>

&quot;<br>

DefaultConsistentHash ctor this=DefaultConsistentHash{<u></u>numSegments=1, numOwners=2, members=[node-1/web, node-0/web], segmentOwners={0: 0 1}system identityHashCode=108706475,<u></u>show segmentOwners[0 of 1] = [node-1/web, node-0/web]<br>


<br>

DefaultConsistentHash ctor numSegments=1, numOwners=2<br>

<br>

DefaultConsistentHash ctor this.segmentOwners[0][0] = node-1/web<br>

&quot;<br></blockquote><div><br>I think I see the problem now... I missed it earlier, but you have configured numSegments = 1, which means there will only be one primary owner for all the keys in the cache. Since node-1 is still alive, it will remain the primary owner for the single segment, and node-0 will never become primary owner for any key. (It will be a backup owner, but KeyAffinityService only looks for primary owners.)<br>

<br>We probably need to add a check to the KeyAffinityServiceImpl constructor and abort if there are not enough segments for each node to be a primary owner (i.e. numSegments &lt; numNodes). In the meantime, I think you can just increase numSegments in your configuration so that it&#39;s greater than the number of nodes and it should work.<br>

<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Since this testing involves multiple tests (org.jboss.as.test.clustering.<u></u>cluster.singleton.<u></u>SingletonTestCase,org.jboss.<u></u>as.test.clustering.cluster.<u></u>web.<u></u>ReplicationWebFailoverTestCase<u></u>,org.jboss.as.test.clustering.<u></u>cluster.web.<u></u>GranularWebFailoverTestCase,<u></u>org.jboss.as.test.clustering.<u></u>cluster.web.passivation.<u></u>SessionBasedSessionPassivation<u></u>TestCase,org.jboss.as.test.<u></u>clustering.cluster.web.<u></u>passivation.<u></u>AttributeBasedSessionPassivati<u></u>onTestCase,org.jboss.as.test.<u></u>clustering.cluster.web.<u></u>DistributionWebFailoverTestCas<u></u>e), its not surprising to see that we reach the DefaultConsistentHash constructor 12 times.  The segment owners for the 12 constructors are in the following:<br>


<br>

1.  [node-0/web]<br>

2.  [node-0/web]<br>

3.  [node-0/web, node-1/web]<br>

4.  [node-0/web, node-1/web]<br>

5.  [node-1/web]<br>

6.  [node-1/web]<br>

7.  [node-1/web]<br>

8.  [node-1/web, node-0/web]<br>

9.  [node-1/web, node-0/web]<br>

10. [node-1/web, node-0/web] (we use this one when stuck in a loop)<br>

11. [node-0/web]<br>

12. [node-0/web]<br>

<br>

We are using the #10 DefaultConsistentHash constructed instance from above for several minutes to an hour (if I let the test run that long while the KeyAffinityServiceImpl.<u></u>getKeyForAddress() continues in the loop).<br>


<br>

Could there be a problem with the ordering of the segment owners?  or is it more that we never switch to use #11/#12 that is likely to be the timing problem?<div class="im"><br></div></blockquote><div><br>The switch to 11/12 only happens after node-1 is dead. All the nodes in the cluster use the same consistent hash, and as long as you have only one segment for two nodes, there&#39;s always going to be one node that isn&#39;t a primary owner.<br>

<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

happen - getCache() should not return that soon - but it could be that<br>

it does happen when multiple threads try to start the same cache in<br>

parallel. Can you post logs with TRACE enabled for org.infinispan and/or<br>

a link to your test code?<br>

<br>

</blockquote>

<br></div>

Sure, I can enable TRACE for Infinispan and attach the logs to the ISPN-2376 jira.  I&#39;ll add links to the test code there as well (as comments).<br>

<br>

Also, KeyAffinityServiceImpl.<u></u>generateKeys() contains:<br>

<br>

&quot;<br>

// if we had too many misses, just release the lock and try again<br>

if (missCount &lt; maxMisses) {<br>

&quot;<br>

<br>

I tried changing the above to a &quot;&gt;=&quot; check and also tried removing the check (just did the keyProducerStartLatch.close()) neither of which had direct impact on the current problem.<br>

<br></blockquote><div><br>I think it&#39;s very likely you also have a thread stuck in KeyAffinityServiceImpl.getKeyForAddress(), and that thread will re-open the latch - so that line doesn&#39;t really matter. But getKeyForAddress() could throw an exception if it sees there&#39;s no way for the target nodes to ever primary-own a key (i.e. numSegments &lt; numNodes).<br>

<br>Cheers<br>Dan<br><br></div></div>