[infinispan-dev] The "Triangle" pattern for reducing Put latency

Wed Nov 25 11:12:12 EST 2015

On 11/25/2015 04:43 PM, Dan Berindei wrote:
> On Wed, Nov 25, 2015 at 5:15 PM, Radim Vansa <rvansa at redhat.com> wrote:
>> On 11/25/2015 03:24 PM, Pedro Ruivo wrote:
>>> On 11/25/2015 01:20 PM, Radim Vansa wrote:
>>>> On 11/25/2015 12:07 PM, Sanne Grinovero wrote:
>>>>> On 25 November 2015 at 10:48, Pedro Ruivo <pedro at infinispan.org> wrote:
>>>>>>> An alternative is to wait for all ACKs, but I think this could still
>>>>>>> be optimised in "triangle shape" too by having the Originator only
>>>>>>> wait for the ACKs from the non-primary replicas?
>>>>>>> So backup owners have to send a confirmation message to the
>>>>>>> Originator, while the Primary owner isn't expecting to do so.
>>>>>> IMO, we should wait for all ACKs to keep our read design.
>>>> What exactly is our 'read design'?
>>> If we don't wait for all the ACKs, then we have to go to the primary
>>> owner for reads, even if the originator is a Backup owner.
>> I don't think so, but we probably have som miscom. If O = B, we still
>> wait for reply from B (which is local) which is triggered by receiving
>> an update from P (after applying the change locally). So it goes
>>
>> OB(application thread) [cache.put()] -(unordered)-> P(worker thread)
>> [applies update] -(ordered)-> OB(worker thread) [applies update]
>> -(in-VM)-> OB(application thread) [continues]
> In your example, O still has to receive a message from P with the
> previous value. The previous value may be included in the update sent
> by the primary, or it may be sent in a separate message, but O still
> has to receive the previous value somehow. Including the previous
> value in the backup update command is not necessary in general (except
> for FunctionalMap's commands, maybe?), so I'd rather use a separate
> message.

All right, in case that we need the previous value it really makes sense 
to send it to O directly.

>
>>>> I think that the source of optimization is that once primary decides to
>>>> backup the operation, he can forget about it and unlock the entry. So,
>>>> we don't need any ACK from primary unless it's an exception/noop
>>>> notification (as with conditional ops). If primary waited for ACK from
>>>> backup, we wouldn't save anything.
>>> About the iteration between P -> B, you're right. We don't need to wait
>>> for the ACKs if the messages are sent in FIFO (and JGroups guarantee that)
>>>
>>> About the O -> P, IMO, the Originator should wait for the reply from
>>> Backup.
>> I was never claiming otherwise, O always needs to wait for ACK from Bs -
>> only then it can successfully report that value has been written on all
>> owners. What does this have to do with O -> P?
> Right, this is the thing I should have brought up during the
> meeting... if we only wait for the Ack from one B, then P can crash
> after we confirmed to the application but before all Bs have received
> the update message, and there will be nobody to retransmit/retry the
> command => inconsistency.

We've been mixing the N-owners and 2-owners case here a bit, so let me 
clarify; anytime I've written that an ack is expected from B, I meant 
from all backups (but not necessarily from primary).
The case with more backups also shows that when a return value other 
than 'true/false=applied/did not apply the update' is needed, we should 
send the response directly from P, because we don't want to send relay 
it through all Bs (or pick one 'special').

>
>>> At least, the Primary would be the only one who needs to return
>>> the previous value (if needed) and it can return if the operation
>>> succeed or not.
>> Simple success: no P -> O, B -> O (success)
>> Simple failure/non-modifying operation (as with putIfAbsent/functional
>> call): P -> O (failure/custom value), no B -> O
>> previous/custom value (as with replace() or functional call): P -> O
>> (previous/custom value), B -> O (success); alternative is P -> B
>> (previous/custom value, new value) and B -> O (previous/custom value)
>> Exception on either P or B: send the exception to O
>> Lost/timed out P -> B: O times out waiting for ack from B, throws exception
>>
> Like I said above, I would prefer it if P would send the previous
> value directly to O (if necessary). Otherwise yeah, I don't see any
> problem with O waiting for replies from P and Bs in parallel.

Agreed.

>
> We've talked several times about removing the replication timeout and
> assuming that a node will always reply in a timely manner to a
> command, unless it's not available. Maybe this time we'll really do it
> :)

That would make sense to me once we have true async calls implemented - 
then, if you want to have timeout-able operation, you would just do 
cache.putAsync().get(my timeout). But I don't promote async calls when 
these consume thread from limited threadpool.
>
>>> This way, it would avoid forking the code for each type
>>> of command without any benefit (I'm thinking sending the reply to
>>> originator in parallel with the update message to the backups).
>> What forking of code for each type do you mean? I see that there are two
>> branches whether the command is going to be replicated to B or not.
> I believe Pedro was talking about having P send the previous value
> directly to O, and so having different handling of replies on O based
> on whether we expect a previous value or not. I'm not that worried
> about it, one way to handle the difference would be to use
> ResponseMode.GET_ALL when the previous value is needed, and GET_NONE
> otherwise.

If the first implementation does not support omitting simple ack P -> O, 
that's fine. But when designing, please don't block the path for a nice 
optimization.

>
> Anyway, I think instead of jumping into implementation and fixing bugs
> as they pop up, this time it may be better to build a model and
> validate it first... then we can discuss changing details on the
> model, and checking them as well. I volunteered to do this with
> Knossos, we'll see how that goes (and when I'll have the time to
> actually work on it...)

No objections :) If you get any interesting results from model checking, 
I am one big ear.

Radim
>
> Dan
>
>
>> Radim
>>
>>>> The gains are:
>>>> * less hops (3 instead of 4 if O != P && O != B)
>>>> * less messages (primary ACK is transitive based on ack from B)
>>>> * shorter lock times (not locking during P -> B RPC)
>>>>
>>>>>> However, the
>>>>>> Originator needs to wait for the ACK from Primary because of conditional
>>>>>> operations and functional API.
>>>>> If the operation is successful, Primary will have to let the
>>>>> secondaries know so these can reply to the Originator directly: still
>>>>> saves an hop.
>>> As I said above: "I'm thinking sending the reply to originator in
>>> parallel with the update message to the backups"
>>>
>>>>>> In this first case, if the conditional operation fail, the Backups are
>>>>>> not bothered. The latter case, we may need the return value from the
>>>>>> function.
>>>>> Right, for a failed or rejected operation the secondaries won't even
>>>>> know about it,
>>>>> so the Primary is in charge of letting the Originator know.
>>>>> Essentially you're highlighting that the Originator needs to wait for
>>>>> either the response from secondaries (all of them?)
>>>>> or from the Primary.
>>>>>
>>>>>>> I suspect the tricky part is what happens when the Primary owner rules
>>>>>>> +1 to apply the change, but then the backup owners (all or some of
>>>>>>> them) somehow fail before letting the Originator know. The Originator
>>>>>>> in this case should seek confirmation about its operation state
>>>>>>> (success?) with the Primary owner; this implies that the Primary owner
>>>>>>> needs to keep track of what it's applied and track failures too, and
>>>>>>> this log needs to be pruned.
>>>> Currently, in case of lost (timed out) ACK from B to P, we just report
>>>> exception and don't care about synchronizing P and B - B can already
>>>> store updated value.
>>>> So we don't have to care about rollback on P if replication to B fails
>>>> either - we just report that it's broken, sorry.
>>>> Better consolidation API would be nice, though, something like
>>>> cache.getAllVersions().
>>>>
>>>> Radim
>>>>
>>>>
>>>>>>> Sounds pretty nice, or am I missing other difficulties?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sanne
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> infinispan-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Radim Vansa <rvansa at redhat.com>
>> JBoss Performance Team
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

-- 
Radim Vansa <rvansa at redhat.com>
JBoss Performance Team