[infinispan-dev] DIST.retrieveFromRemoteSource

Wed Jan 25 07:06:02 EST 2012

On 25 January 2012 11:48, Mircea Markus <mircea.markus at jboss.com> wrote:
>
> On 25 Jan 2012, at 08:51, Dan Berindei wrote:
>
>> Hi Sanne
>>
>> On Wed, Jan 25, 2012 at 1:22 AM, Sanne Grinovero <sanne at infinispan.org> wrote:
>>> Hello,
>>> in the method:
>>> org.infinispan.distribution.DistributionManagerImpl.retrieveFromRemoteSource(Object,
>>> InvocationContext, boolean)
>>>
>>> we have:
>>>
>>>      List<Address> targets = locate(key);
>>>      // if any of the recipients has left the cluster since the
>>> command was issued, just don't wait for its response
>>>      targets.retainAll(rpcManager.getTransport().getMembers());
>>>
>>> But then then we use ResponseMode.WAIT_FOR_VALID_RESPONSE, which means
>>> we're not going to wait for all responses anyway, and I think we might
>>> assume to get a reply by a node which actually is in the cluster.
>>>
>>> So the retainAll method is unneeded and can be removed? I'm wondering,
>>> because it's not safe anyway, actually it seems very unlikely to me
>>> that just between a locate(key) and the retainAll the view is being
>>> changed, so not something we should be relying on anyway.
>>> I'd rather assume that such a get method might be checked and
>>> eventually dropped by the receiver.
>>>
>>
>> The locate method will return a list of owners based on the
>> "committed" cache view, so there is a non-zero probability that one of
>> the owners has already left.
>>
>> If I remember correctly, I added the retainAll call because otherwise
>> ClusteredGetResponseFilter.needMoreResponses() would keep returning
>> true if one of the targets left the cluster. Coupled with the fact
>> that null responses are not considered valid (unless *all* responses
>> are null), this meant that a remote get for a key that doesn't exist
>> would throw a TimeoutException after 15 seconds instead of returning
>> null immediately.
>>
>> We could revisit the decision to make null responses invalid, and then
>> as long as there is still one of the old owners left in the cluster
>> you'll get the null result immediately.
> can't we just directly to a single node for getting a remote value? The main data owner perhaps. If the node is down we can retry to another node. Going to multiple nodes seems like a waist of resources - network in this case.

I agree, we should not ask all replicas for the same information.
Asking only one is the opposite though: I think this should be a
configuration option to ask for any value between (1 and numOwner).
That's because I understand it might be beneficial to ask to more than
one node immediately, but assuming we have many owners it would be
nice to pick only a subset.

A second configuration option should care about which strategy the
subset is selected. In case we ask only one node, I'm not sure if the
first node would be the best option.

Dan, thank you for your explanation, this makes much more sense now.
Indeed I agree as well that we should revisit the interpretation of
"return null", that's not an unlikely case since every put will run a
get first.