[infinispan-dev] DIST.retrieveFromRemoteSource

Wed Jan 25 12:43:51 EST 2012

On 25 Jan 2012, at 17:09, Dan Berindei wrote:

> On Wed, Jan 25, 2012 at 4:22 PM, Mircea Markus <mircea.markus at jboss.com> wrote:
>> 
>> One node might be busy doing GC and stay unresponsive for a whole
>> 
>> second or longer, another one might be actually crashed and you didn't
>> 
>> know that yet, these are unlikely but possible.
>> 
>> All these are possible but I would rather consider them as exceptional
>> situations, possibly handled by a retry logic. We should *not* optimise for
>> that these situations IMO.
>> 
> 
> As Sanne pointed out, an exceptional situation on a node becomes
> ordinary with 100s or 1000s of nodes.

possible, but still that's not our every day use case yet. 
For the *default* value I'd rather consider an 4-20 cluster. The idea is to have numGets configurable, or even dynamic.
> So the default policy should scale the initial number of requests with
> numOwners.
Not sure what you mean by that. 
As you mention, there might be a correlation between the number of nodes to which to send the remote get and the cluster size.

> 
>> 
>> More likely, a rehash is in progress, you could then be asking a node
>> 
>> which doesn't yet (or anymore) have the value.
>> 
>> 
>> this is a consistency issue and I think we can find a way to handle it some
>> other way.
>> 
> 
> With the current state transfer we always send ClusteredGetCommands to
> the old owners (and only the old owners). If a node didn't receive the
> entire state, it means that state transfer hasn't finished yet and the
> CH will not return it as an owner. But the CH could also return owners
> that are no longer members of the cluster, so we have to check for
> that before picking one owner to send the command to.
> 
> In Sanne's non-blocking state transfer proposal I think a new owner
> may have to ask the old owner for the key value, so it would still
> never return null. But it might be less expensive to ask the old owner
> directly (assuming it's safe from a consistency POV).
> 
>> 
>> All good reasons for which imho it makes sense to send out "a couple"
>> 
>> of requests in parallel, but I'd unlikely want to send more than 2,
>> 
>> and I agree often 1 might be enough.
>> 
>> Maybe it should even optimize for the most common case: send out just
>> 
>> one, have a more aggressive timeout and in case of trouble ask for the
>> 
>> next node.
>> 
>> +1
>> 
> 
> -1 for aggressive timeouts... you're going to do the same work as you
> do now, except you're going to wait a bit between sending requests. If
> you're really unlucky the first target will return first but you'll
> ignore its response because the timeout already expired.
> 
>> 
>> In addition, sending a single request might spare us some Future,
>> 
>> await+notify messing in terms of CPU cost of sending the request.
>> 
>> it's the remote OOB thread that's the most costly resource imo.
>> 
> 
> I don't think the OOB thread is that costly, it doesn't block on
> anything (not even on state transfer!) so the most expensive part is
> reading the key and writing the value. BTW Sanne, we may want to run
> Transactional with a smaller payload size ;)
Yes, besides using the OOB pool unnecessarily, other resource are also costumed. Not sure I agree that OOB thread usage is not costly: this pool is also used for releasing locks and exhausting it might result in a chained performance degradation.
> 
> We could implement our own GroupRequest that sends the requests in
> parallel instead implementing FutureCollator on top of UnicastRequest
> and save some of that overhead on the caller.
> 
> I think we already have a JIRA to make PutKeyValueCommands return the
> previous value, that would eliminate lots of GetKeyValueCommands and
> it would actually improve the performance of puts - we should probably
> make this a priority.
Not saying that sending requests in parallel doesn't make sense: just questioning weather it makes sense to *always* send them in parallel. 
> 
>> 
>> I think I agree on all points, it makes more sense.
>> Just that in a large cluster, let's say
>> 1000 nodes, maybe I want 20 owners as a sweet spot for read/write
>> performance tradeoff, and with such high numbers I guess doing 2-3
>> gets in parallel might make sense as those "unlikely" events, suddenly
>> are an almost certain.. especially the rehash in progress.
>> 
>> So I'd propose a separate configuration option for # parallel get
>> events, and one to define a "try next node" policy. Or this policy
>> should be the whole strategy, and the #gets one of the options for the
>> default implementation.
>> 
>> Agreed that having a configurable remote get policy makes sense.
>> We already have a JIRA for this[1], I'll start working on it as the
>> performance results are hunting me.
> 
> I'd rather focus on implementing one remote get policy that works
> instead of making it configurable - even if we make it configurable
> we'll have to focus our optimizations on the default policy.

This *might* make a significant difference in cluster's performance, so IMO it's worth giving it a try.

> 
> Keep in mind that we also want to introduce eventual consistency - I
> think that's going to eliminate our optimization opportunity here
> because we'll need to get the values from a majority of owners (if not
> all the owners).
I'm sure we can support both approaches if it's worthy  :-)
> 
>> I'd like to have Dan's input on this as well first, as he has worked with
>> remote gets and I still don't know why null results are not considered valid
>> :)
> 
> Pre-5.0 during state transfer an owner could return null to mean "I'm
> not sure", so the caller would ignore it unless every target returned
> null.
> That's no longer necessary, but it wasn't broken so I didn't fix it...
Good to know, this should make implementing ISPN-825 quite easy then. I'll give it a spin to check the performance.