<div class="gmail_extra"><br><div class="gmail_quote">On Wed, Dec 5, 2012 at 4:20 PM, Sanne Grinovero <span dir="ltr">&lt;<a href="mailto:sanne@infinispan.org" target="_blank">sanne@infinispan.org</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 5 December 2012 14:01, Bela Ban &lt;<a href="mailto:bban@redhat.com">bban@redhat.com</a>&gt; wrote:<br>


&gt;<br>

&gt; On 12/5/12 1:23 PM, Sanne Grinovero wrote:<br>

&gt;&gt; On 5 December 2012 11:02, Galder Zamarreño &lt;<a href="mailto:galder@redhat.com">galder@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt; On Dec 4, 2012, at 10:22 AM, Sanne Grinovero &lt;<a href="mailto:sanne@infinispan.org">sanne@infinispan.org</a>&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; On 4 December 2012 09:14, Galder Zamarreño &lt;<a href="mailto:galder@redhat.com">galder@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt;&gt;&gt; Hey Dan/Adrian,<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; Re: <a href="https://issues.jboss.org/browse/ISPN-2541" target="_blank">https://issues.jboss.org/browse/ISPN-2541</a><br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; I&#39;m looking at this intermittent failure, and it seems to be caused by the fact that the test does not wait for the cluster to be formed when the new node is started, which can lead a replication timeout failure from the new joining node.<br>


&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; The test can easily be fixed by waiting for cluster to form, and then do the call.<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; [...]<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; I don&#39;t think the cache should ever be in an illegal state to be used<br>

&gt;&gt;&gt;&gt; after being started. So Infinispan should not require tests to wait<br>

&gt;&gt;&gt;&gt; for a &quot;cluster to be formed&quot;, I&#39;d rather guarantee that after a cache<br>

&gt;&gt;&gt;&gt; is started it&#39;s usable.<br>

&gt;&gt;&gt; Precisely, which is why I raised the flag instead of going down the easy path.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; If this is not possible, then any application would also need to wait<br>

&gt;&gt;&gt;&gt; for that &quot;cluster formed&quot; event, and we should expose an API for that.<br>

&gt;&gt;&gt; The problem is considering when a cluster is formed. How many nodes should you wait for?<br>

&gt;&gt;&gt;<br>

&gt;&gt; Why can&#39;t we rely on JGroups Discovery to know that, as a user I<br>

&gt;&gt; already specified the expected initial group size with<br>

&gt;&gt; num_initial_members<br>

&gt;&gt; Don&#39;t want to repeat that configuration ;-)<br>

&gt;<br>

&gt;<br>

&gt; I don&#39;t understand this discussion: when a new node join, it&#39;ll return<br>

&gt; from JChannel.connect() when it received a JOIN response from the<br>

&gt; coordinator, with the current view... or are you guys talking about<br>

&gt; Infinispan&#39;s &#39;service views&#39; ?<br>

<br>

</div></div>+1<br>

<br>

That&#39;s why I&#39;m confused too, and not understanding how it is possible<br>

that a Cache is returned to the application - which doesn&#39;t have a<br>

clue about number of expected nodes - in a state for which the<br>

&quot;cluster is not formed yet&quot;. That should never happen!?<br>

<br></blockquote><div><br>It&#39;s simple: getCache() returns once the joiner has received ownership of some segments (in distributed mode) and once it received all the data it owner (dist and repl). This does not guarantee that the other nodes see the joiner as a full member at the time getCache() has returned.<br>

<br>This doesn&#39;t mean that the cache is not functional, on the contrary we could return even before the joiner had received the data and the cache would still work. But because some nodes think state transfer is still in progress, the tests do run into state transfer corner cases that aren&#39;t handled properly (they&#39;re getting rarer, but we still have them).<br>

<br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I never understood why the test framework in Infinispan requires this<br>

to happen in all tests - even in the cases listed by Mircea that the<br>

testsuite is looking for something very specific, I would expect the<br>

wait to be unnecessary. (or more precisely, to have been blocked<br>

already for long enough)<br>

<div class="HOEnZb"><div class="h5"><br></div></div></blockquote><div><br>getCache() only waits enough for the cache to &quot;work&quot;, it doesn&#39;t wait (and I don&#39;t think it should wait) for all the other nodes to acknowledge the joiner as a full member (i.e. in the &quot;read&quot; consistent hash). Because of this, assertions made on nodes other than the joiner can fail (in addition to the aforementioned corner cases in state transfer).<br>

<br>It&#39;s also possible (and it was quite likely with older JGroups versions) that a joiner would actually form a new cluster by itself instead of joining the existing nodes in a single cluster. When that happens, getCache() definitely returns without the cluster being formed, and we have to wait for the separate clusters to find each other and merge before running our test.<br>

<br>Cheers<br>Dan<br><br></div></div></div>