<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Apr 15, 2013 at 1:30 PM, Sanne Grinovero <span dir="ltr">&lt;<a href="mailto:sanne@infinispan.org" target="_blank">sanne@infinispan.org</a>&gt;</span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I&#39;ve attached the logs the the JIRA.<br>

<br>

Some replies inline:<br>

<div><div class="h5"><br>

On 15 April 2013 11:04, Dan Berindei &lt;<a href="mailto:dan.berindei@gmail.com">dan.berindei@gmail.com</a>&gt; wrote:<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; On Sat, Apr 13, 2013 at 2:42 PM, Sanne Grinovero &lt;<a href="mailto:sanne@infinispan.org">sanne@infinispan.org</a>&gt;<br>

&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt; On 13 April 2013 11:20, Bela Ban &lt;<a href="mailto:bban@redhat.com">bban@redhat.com</a>&gt; wrote:<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; On 4/13/13 2:02 AM, Sanne Grinovero wrote:<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;&gt; @All, the performance problem seemed to be caused by a problem in<br>

&gt;&gt; &gt;&gt; JGroups, which I&#39;ve logged here:<br>

&gt;&gt; &gt;&gt; <a href="https://issues.jboss.org/browse/JGRP-1617" target="_blank">https://issues.jboss.org/browse/JGRP-1617</a><br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Almost no information attached to the case :-( If it wasn&#39;t you, Sanne,<br>

&gt;&gt; &gt; I&#39;d outright reject the case ...<br>

&gt;&gt;<br>

&gt;&gt; I wouldn&#39;t blame you, and am sorry for the lack of details: as I said<br>

&gt;&gt; it was very late, still I preferred to share the observations we made<br>

&gt;&gt; so far.<br>

&gt;&gt;<br>

&gt;&gt; &gt;From all the experiments we made - and some good logs I&#39;ll cleanup for<br>

&gt;&gt; sharing - it&#39;s clear that the thread is not woken up while the ACK was<br>

&gt;&gt; already received.<br>

&gt;&gt; And of course I wouldn&#39;t expect this to fail in a simple test as it<br>

&gt;&gt; wouldn&#39;t have escaped you ;-) or at least you would have had earlier<br>

&gt;&gt; reports.<br>

&gt;&gt;<br>

&gt;&gt; There are lots of complex moving parts in this scenario: from a Muxed<br>

&gt;&gt; JGroups Channel, and the Application Server responsible for<br>

&gt;&gt; initializing the stack with some added magic from CapeDwarf itself:<br>

&gt;&gt; it&#39;s not clear to me what configuration is exactly being used, for<br>

&gt;&gt; one.<br>

&gt;&gt;<br>

&gt;<br>

&gt; Does CD also change the JGroups configuration? I thought it only tweaks the<br>

&gt; Infinispan cache configuration on deployment, and the JGroups channel is<br>

&gt; already started by the time the CD application is deployed.<br>

<br>

</div></div>CD uses a custom AS build and a custom AS configuration, so anything<br>

could be different.<br>

On top of that, some things are reconfigured programmatically by it.<br>

<div class="im"><br></div></blockquote><div><br></div><div>Ales already cleared this out, CD doesn&#39;t change the JGroups config at all.<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div class="im">

<br>

&gt;&gt; Without a testcase we might not be 100% sure but it seems likely to be<br>

&gt;&gt; an unexpected behaviour in JGroups, at least under some very specific<br>

&gt;&gt; setup.<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; I&#39;m glad to help tracking down more details of what could trigger<br>

&gt;&gt; this, but I&#39;m not too eager to write a full unit test for this as it<br>

&gt;&gt; involves a lot of other components, and by mocking my own components<br>

&gt;&gt; out I could still reproduce it: it&#39;s not Hibernate Search, so I&#39;ll<br>

&gt;&gt; need the help from the field experts.<br>

&gt;&gt;<br>

&gt;&gt; Also I suspect a test would need to depend on many more components: is<br>

&gt;&gt; JGroups having an easy way to manage dependencies nowadays?<br>

&gt;&gt;<br>

&gt;&gt; some more inline:<br>

&gt;&gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; The MessageDispatcher will *not* wait until the timeout kicks in, it&#39;ll<br>

&gt;&gt; &gt; return as soon as it has acks from all members of the target set. This<br>

&gt;&gt; &gt; works and is covered with a bunch of unit tests, so a regression would<br>

&gt;&gt; &gt; have been caught immediately.<br>

&gt;&gt;<br>

&gt;&gt; I don&#39;t doubt the &quot;vanilla scenario&quot;, but this is what happens in the<br>

&gt;&gt; more complex case of the CapeDwarf setup.<br>

&gt;&gt;<br>

&gt;<br>

&gt; My first guess would be that the MuxRpcDispatcher on the second node hasn&#39;t<br>

&gt; started yet by the time you call castMessage on the first node. It could be<br>

&gt; that your workaround just delayed the message a little bit, until the<br>

&gt; MuxRpcDispatcher on the other node actually started (because the JChannel is<br>

&gt; already started on both nodes, but as long as the MuxRpcDispatcher isn&#39;t<br>

&gt; started on the 2nd node it won&#39;t send any responses back).<br>

<br>

</div>Before the point in which Search uses the dispatcher, many more<br>

operations happened succesfully and with a reasonable timing:<br>

especially some transactions on Infinispan stored entries quickly and<br>

without trouble.<br>

<br>

Besides if such a race condition would be possible, I would consider<br>

it a critical bug.<br>

<div class="im"><br></div></blockquote><div><br></div>I looked at the muxer code and I think they actually take care of this already: MuxUpHandler returns a NoMuxHandler response when it can&#39;t find an appropriate MuxedRpcDispatcher, and MessageDispatcher counts that response against the number of expected responses. So it must be something else...<br>


<br><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

<br>

&gt;&gt; &gt; I attached a test program to JGRP-1617 which shows that this feature<br>

&gt;&gt; &gt; works correctly.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; Of course, if you lose an ack (e.g. due to a maxed out incoming / OOB<br>

&gt;&gt; &gt; thread pool), the unicast protocol will have to retransmit the ack until<br>

&gt;&gt; &gt; it has been received. Depending on the unicast protocol you use, this<br>

&gt;&gt; &gt; will be immediate (UNICAST, UNICAST3), or based on a stability interval<br>

&gt;&gt; &gt; (UNICAST2).<br>

&gt;&gt;<br>

&gt;&gt; Right it&#39;s totally possible this is a stack configuration problem in the<br>

&gt;&gt; AS.<br>

&gt;&gt; I wouldn&#39;t be the best to ask that though, I don&#39;t even understand the<br>

&gt;&gt; configuration format.<br>

&gt;&gt;<br>

&gt;<br>

&gt; You can get the actual JGroups configuration with<br>

&gt; channel.getProtocolStack().printProtocolSpecAsXml(), but I wouldn&#39;t expect<br>

&gt; you to find any surprises there: they should use pretty much the JGroups<br>

&gt; defaults.<br>

<br>

</div>Nice tip. I&#39;ll add this as a logging option.<br>

<div class="im"><br>

&gt;<br>

&gt; By default STABLE.desired_avg_gossip is 20s and STABLE.stability_delay is<br>

&gt; 6s, so even if the message was lost it should take &lt; 30s for the message to<br>

&gt; be resent.<br>

<br>

</div>The delay is actually ~10 seconds per RPC, so still &lt;30s.<br>

The reason the overall test takes 60 seconds is because there are 6<br>

operations being performed.<br>

<div><div class="h5"><br></div></div></blockquote><div><br></div><div>Ok, this means my hunch definitely wasn&#39;t true, that would have explained only the first request timing out.<br><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div><div class="h5">

<br>

&gt;&gt; &gt;&gt; For the record, the first operation was indeed triggering some lazy<br>

&gt;&gt; &gt;&gt; initialization of indexes, which in turn would trigger a Lucene<br>

&gt;&gt; &gt;&gt; Directory being started, triggering 3 Cache starts which in turn would<br>

&gt;&gt; &gt;&gt; trigger 6 state transfer processes: so indeed the first operation<br>

&gt;&gt; &gt;&gt; would not be exactly &quot;cheap&quot; performance wise, still this would<br>

&gt;&gt; &gt;&gt; complete in about 120 milliseconds.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; This sounds very low for the work you describe above. I don&#39;t think 6<br>

&gt;&gt; &gt; state transfers can be completed in 120ms, unless they&#39;re async (but<br>

&gt;&gt; &gt; then that means they&#39;re not done when you return). Also, cache starts<br>

&gt;&gt; &gt; (wrt JGroups) will definitely take more than a few seconds if you&#39;re the<br>

&gt;&gt; &gt; first cluster node...<br>

&gt;&gt;<br>

&gt;&gt; It&#39;s a unit test: the caches are initially empty and networking is<br>

&gt;&gt; loopback,<br>

&gt;&gt; on the second round some ~6 elements are in the cache, no larger than<br>

&gt;&gt; ~10 character strings.<br>

&gt;&gt; Should be reasonable?<br>

&gt;&gt;<br>

&gt;<br>

&gt; Yes, I think it&#39;s reasonable, if the JChannel was already started before the<br>

&gt; CD application was deployed. Starting the first JChannel would take at least<br>

&gt; 3s, which is the default PING.timeout.<br>

&gt;<br>

&gt;<br>

&gt;&gt;<br>

&gt;&gt; &gt;&gt; Not being sure about the options of depending to a newer JGroups<br>

&gt;&gt; &gt;&gt; release or the complexity of a fix, I&#39;ll implement a workaround in<br>

&gt;&gt; &gt;&gt; HSearch in the scope of HSEARCH-1296.<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt;<br>

&gt;&gt; &gt; If you add more information to JGRP-1617, I&#39;ll take a look. This would<br>

&gt;&gt; &gt; be a critical bug in JGroups *if* you can prove that the<br>

&gt;&gt; &gt; MessageDispatcher always runs into the timeout (I don&#39;t think you can<br>

&gt;&gt; &gt; though !).<br>

&gt;&gt;<br>

&gt;&gt; Considering the easy workaround and that definitely this needs<br>

&gt;&gt; something special in the configuration, I wouldn&#39;t consider it too<br>

&gt;&gt; critical? For as far as we know now, it&#39;s entirely possible the<br>

&gt;&gt; configuration being used is illegal. But this is exactly where I need<br>

&gt;&gt; your help ;-)<br>

&gt;&gt;<br>

&gt;<br>

&gt; I&#39;m not sure that your workaround is 100% effective, even if it doesn&#39;t<br>

&gt; happen in this test it&#39;s always possible to have the app deployed on some of<br>

&gt; the nodes in the cluster, but not all.<br>

<br>

</div></div>That&#39;s right. I would prefer not to apply anything like that, but for<br>

the sake of the experiment it was usefull to isolate the problem.<br>

<br></blockquote><div><br></div><div>Looking at your workaround, I think you actually set the response mode to GET_NONE (because that&#39;s the default value in RequestOptions), so you&#39;re back to sending an asynchronous request.<br>


</div></div></div></div>