<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><blockquote type="cite"><div><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">One node might be busy doing GC and stay unresponsive for a whole<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">second or longer, another one might be actually crashed and you didn't<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">know that yet, these are unlikely but possible.<br></blockquote></blockquote><blockquote type="cite">All these are possible but I would rather consider them as exceptional situations, possibly handled by a retry logic. We should *not* optimise for that these situations IMO.<br></blockquote><blockquote type="cite">Thinking about our last performance results, we have avg 26k &nbsp; &nbsp;gets per second. Now with numOwners = 2, these means that each node handles 26k *redundant* gets every second: I'm not concerned about the network load, as Bela mentioned in a previous mail the network link should not be the bottleneck, but there's a huge unnecessary activity in OOB threads which should rather be used for releasing locks or whatever needed. On top of that, this consuming activity highly encourages GC pauses, as the effort for a get is practically numOwners higher than it should be.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">More likely, a rehash is in progress, you could then be asking a node<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">which doesn't yet (or anymore) have the value.<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">this is a consistency issue and I think we can find a way to handle it some other way.<br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">All good reasons for which imho it makes sense to send out "a couple"<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">of requests in parallel, but I'd unlikely want to send more than 2,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">and I agree often 1 might be enough.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Maybe it should even optimize for the most common case: send out just<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">one, have a more aggressive timeout and in case of trouble ask for the<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">next node.<br></blockquote></blockquote><blockquote type="cite">+1<br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">In addition, sending a single request might spare us some Future,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">await+notify messing in terms of CPU cost of sending the request.<br></blockquote></blockquote><blockquote type="cite">it's the remote OOB thread that's the most costly resource imo.<br></blockquote><br>I think I agree on all points, it makes more sense.<br>Just that in a large cluster, let's say<br>1000 nodes, maybe I want 20 owners as a sweet spot for read/write<br>performance tradeoff, and with such high numbers I guess doing 2-3<br>gets in parallel might make sense as those "unlikely" events, suddenly<br>are an almost certain.. especially the rehash in progress.<br></div></blockquote><blockquote type="cite"><div>So I'd propose a separate configuration option for # parallel get<br>events, and one to define a "try next node" policy. Or this policy<br>should be the whole strategy, and the #gets one of the options for the<br>default implementation.<br></div></blockquote></div><div>Agreed that having a configurable remote get policy makes sense.&nbsp;</div><div>We already have a JIRA for this[1], I'll start working on it as the performance results are hunting me.</div><div>I'd like to have Dan's input on this as well first, as he has worked with remote gets and I still don't know why null results are not considered valid :)</div><div><br></div><div>[1]&nbsp;<a href="https://issues.jboss.org/browse/ISPN-825">https://issues.jboss.org/browse/ISPN-825</a></div></body></html>