On 11/25/2015 04:43 PM, Dan Berindei wrote:
On Wed, Nov 25, 2015 at 5:15 PM, Radim Vansa
<rvansa(a)redhat.com> wrote:
> On 11/25/2015 03:24 PM, Pedro Ruivo wrote:
>> On 11/25/2015 01:20 PM, Radim Vansa wrote:
>>> On 11/25/2015 12:07 PM, Sanne Grinovero wrote:
>>>> On 25 November 2015 at 10:48, Pedro Ruivo <pedro(a)infinispan.org>
wrote:
>>>>>> An alternative is to wait for all ACKs, but I think this could
still
>>>>>> be optimised in "triangle shape" too by having the
Originator only
>>>>>> wait for the ACKs from the non-primary replicas?
>>>>>> So backup owners have to send a confirmation message to the
>>>>>> Originator, while the Primary owner isn't expecting to do
so.
>>>>> IMO, we should wait for all ACKs to keep our read design.
>>> What exactly is our 'read design'?
>> If we don't wait for all the ACKs, then we have to go to the primary
>> owner for reads, even if the originator is a Backup owner.
> I don't think so, but we probably have som miscom. If O = B, we still
> wait for reply from B (which is local) which is triggered by receiving
> an update from P (after applying the change locally). So it goes
>
> OB(application thread) [cache.put()] -(unordered)-> P(worker thread)
> [applies update] -(ordered)-> OB(worker thread) [applies update]
> -(in-VM)-> OB(application thread) [continues]
In your example, O still has to receive a message from P with the
previous value. The previous value may be included in the update sent
by the primary, or it may be sent in a separate message, but O still
has to receive the previous value somehow. Including the previous
value in the backup update command is not necessary in general (except
for FunctionalMap's commands, maybe?), so I'd rather use a separate
message.
All right, in case that we need the previous value it really makes sense
to send it to O directly.
>>> I think that the source of optimization is that once primary decides to
>>> backup the operation, he can forget about it and unlock the entry. So,
>>> we don't need any ACK from primary unless it's an exception/noop
>>> notification (as with conditional ops). If primary waited for ACK from
>>> backup, we wouldn't save anything.
>> About the iteration between P -> B, you're right. We don't need to
wait
>> for the ACKs if the messages are sent in FIFO (and JGroups guarantee that)
>>
>> About the O -> P, IMO, the Originator should wait for the reply from
>> Backup.
> I was never claiming otherwise, O always needs to wait for ACK from Bs -
> only then it can successfully report that value has been written on all
> owners. What does this have to do with O -> P?
Right, this is the thing I should have brought up during the
meeting... if we only wait for the Ack from one B, then P can crash
after we confirmed to the application but before all Bs have received
the update message, and there will be nobody to retransmit/retry the
command => inconsistency.
We've been mixing the N-owners and 2-owners case here a bit, so let me
clarify; anytime I've written that an ack is expected from B, I meant
from all backups (but not necessarily from primary).
The case with more backups also shows that when a return value other
than 'true/false=applied/did not apply the update' is needed, we should
send the response directly from P, because we don't want to send relay
it through all Bs (or pick one 'special').
>> At least, the Primary would be the only one who needs to return
>> the previous value (if needed) and it can return if the operation
>> succeed or not.
> Simple success: no P -> O, B -> O (success)
> Simple failure/non-modifying operation (as with putIfAbsent/functional
> call): P -> O (failure/custom value), no B -> O
> previous/custom value (as with replace() or functional call): P -> O
> (previous/custom value), B -> O (success); alternative is P -> B
> (previous/custom value, new value) and B -> O (previous/custom value)
> Exception on either P or B: send the exception to O
> Lost/timed out P -> B: O times out waiting for ack from B, throws exception
>
Like I said above, I would prefer it if P would send the previous
value directly to O (if necessary). Otherwise yeah, I don't see any
problem with O waiting for replies from P and Bs in parallel.
Agreed.
We've talked several times about removing the replication timeout and
assuming that a node will always reply in a timely manner to a
command, unless it's not available. Maybe this time we'll really do it
:)
That would make sense to me once we have true async calls implemented -
then, if you want to have timeout-able operation, you would just do
cache.putAsync().get(my timeout). But I don't promote async calls when
these consume thread from limited threadpool.
>> This way, it would avoid forking the code for each type
>> of command without any benefit (I'm thinking sending the reply to
>> originator in parallel with the update message to the backups).
> What forking of code for each type do you mean? I see that there are two
> branches whether the command is going to be replicated to B or not.
I believe Pedro was talking about having P send the previous value
directly to O, and so having different handling of replies on O based
on whether we expect a previous value or not. I'm not that worried
about it, one way to handle the difference would be to use
ResponseMode.GET_ALL when the previous value is needed, and GET_NONE
otherwise.
If the first implementation does not support omitting simple ack P -> O,
that's fine. But when designing, please don't block the path for a nice
optimization.
Anyway, I think instead of jumping into implementation and fixing bugs
as they pop up, this time it may be better to build a model and
validate it first... then we can discuss changing details on the
model, and checking them as well. I volunteered to do this with
Knossos, we'll see how that goes (and when I'll have the time to
actually work on it...)
No objections :) If you get any interesting results from model checking,
I am one big ear.
Radim
Dan
> Radim
>
>>> The gains are:
>>> * less hops (3 instead of 4 if O != P && O != B)
>>> * less messages (primary ACK is transitive based on ack from B)
>>> * shorter lock times (not locking during P -> B RPC)
>>>
>>>>> However, the
>>>>> Originator needs to wait for the ACK from Primary because of
conditional
>>>>> operations and functional API.
>>>> If the operation is successful, Primary will have to let the
>>>> secondaries know so these can reply to the Originator directly: still
>>>> saves an hop.
>> As I said above: "I'm thinking sending the reply to originator in
>> parallel with the update message to the backups"
>>
>>>>> In this first case, if the conditional operation fail, the Backups
are
>>>>> not bothered. The latter case, we may need the return value from the
>>>>> function.
>>>> Right, for a failed or rejected operation the secondaries won't even
>>>> know about it,
>>>> so the Primary is in charge of letting the Originator know.
>>>> Essentially you're highlighting that the Originator needs to wait
for
>>>> either the response from secondaries (all of them?)
>>>> or from the Primary.
>>>>
>>>>>> I suspect the tricky part is what happens when the Primary owner
rules
>>>>>> +1 to apply the change, but then the backup owners (all or some
of
>>>>>> them) somehow fail before letting the Originator know. The
Originator
>>>>>> in this case should seek confirmation about its operation state
>>>>>> (success?) with the Primary owner; this implies that the Primary
owner
>>>>>> needs to keep track of what it's applied and track failures
too, and
>>>>>> this log needs to be pruned.
>>> Currently, in case of lost (timed out) ACK from B to P, we just report
>>> exception and don't care about synchronizing P and B - B can already
>>> store updated value.
>>> So we don't have to care about rollback on P if replication to B fails
>>> either - we just report that it's broken, sorry.
>>> Better consolidation API would be nice, though, something like
>>> cache.getAllVersions().
>>>
>>> Radim
>>>
>>>
>>>>>> Sounds pretty nice, or am I missing other difficulties?
>>>>>>
>>>>>> Thanks,
>>>>>> Sanne
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> infinispan-dev(a)lists.jboss.org
>>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev(a)lists.jboss.org
>>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev(a)lists.jboss.org
>>>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Radim Vansa <rvansa(a)redhat.com>
> JBoss Performance Team
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev
--
Radim Vansa <rvansa(a)redhat.com>
JBoss Performance Team