[infinispan-issues] [JBoss JIRA] (ISPN-8182) Asynchronous commands should be retried if topology is outdated

Tuesday, 8 August 2017

    [
https://issues.jboss.org/browse/ISPN-8182?page=com.atlassian.jira.plugin....
] 

Galder Zamarreño edited comment on ISPN-8182 at 8/8/17 5:05 AM:
----------------------------------------------------------------

A couple of IRC discussions we've had so far:

{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about
ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator
forgets the command 
immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw
an 
OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it.
If you want to 
apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes
throwing OTE 
is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could
try to do 
that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated
topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an
owner or not. 
throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care
if the cache is 
async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a
separate threads (with 
the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything
below STI should be 
executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to
proper 
primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}

And:

{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
    does not have to be thrown at all
...
 rvansa: right, assuming we provide our own interceptor for timestamp
    and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
    infinispan issue
...
 rvansa: both for REPL and DIST? <rvansa> galderz:
yesterday before the meeting I've suggested that
    async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
    owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
...
 rvansa: makes sense <rvansa> galderz: actually, throwing
and retrying locally might be
    needed - I would prefer to fix topology for a given command below
    STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
    needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
    how to make the 'eventually' happen
...
 rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
    node that did not get the update yet, its version would be stale
...
 dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
    topology is outdated [New (Unresolved) Enhancement, Major, Core,
    Unassigned] https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
    all combos of primary/backup/non-owner transitions to see if
    anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
    in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
    leaves, some keys will have new owners, and those owners may or
    may not receive the updated value
...
 rvansa: if stale data is expected, then we're in the same
scenario     as now really
<dberindei> galderz: the fact that we currently check the topology id
    and throw an exception means the update will can be missed on
    owners that aren't new
...
 what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
    topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
...
 dberindei: got it
 dberindei: so, should remote nodes not throw that exception for     async puts? or
should still be thrown and then retried?
...
 dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
    only change would be in StateTransferInterceptor (I think)
...
 dberindei: ok
 dberindei: do we have any stress tests where we could add tests for     seeing that
it all works fine for repl async puts?
{code}

was (Author: galder.zamarreno):
A couple of IRC discussions we've had so far:

{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about
ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator
forgets the command immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw
an OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it.
If you want to apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes
throwing OTE is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could
try to do that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated
topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an
owner or not. throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care
if the cache is async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a
separate threads (with the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything
below STI should be executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to
proper primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}

And:

{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
    does not have to be thrown at all
...
 rvansa: right, assuming we provide our own interceptor for timestamp
    and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
    infinispan issue
...
 rvansa: both for REPL and DIST? <rvansa> galderz:
yesterday before the meeting I've suggested that
    async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
    owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
...
 rvansa: makes sense <rvansa> galderz: actually, throwing
and retrying locally might be
    needed - I would prefer to fix topology for a given command below
    STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
    needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
    how to make the 'eventually' happen
...
 rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
    node that did not get the update yet, its version would be stale
...
 dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
    topology is outdated [New (Unresolved) Enhancement, Major, Core,
    Unassigned] https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
    all combos of primary/backup/non-owner transitions to see if
    anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
    in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
    leaves, some keys will have new owners, and those owners may or
    may not receive the updated value
...
 rvansa: if stale data is expected, then we're in the same
scenario     as now really
<dberindei> galderz: the fact that we currently check the topology id
    and throw an exception means the update will can be missed on
    owners that aren't new
...
 what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
    topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
...
 dberindei: got it
 dberindei: so, should remote nodes not throw that exception for     async puts? or
should still be thrown and then retried?
...
 dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
    only change would be in StateTransferInterceptor (I think)
...
 dberindei: ok
 dberindei: do we have any stress tests where we could add tests for     seeing that
it all works fine for repl async puts?
{code}

...
 Asynchronous commands should be retried if topology is outdated
 ---------------------------------------------------------------

                 Key: ISPN-8182
                 URL: https://issues.jboss.org/browse/ISPN-8182
             Project: Infinispan
          Issue Type: Enhancement
          Components: Core
    Affects Versions: 9.1.0.Final
            Reporter: Galder Zamarreño

 If an asynchronous command fails at a remote node, it should be retried.
 I'm not sure how feasible this really is. One possible solution could be this: having
NACK style implementation where by default the originator assumes an asynchronous command
has been executed, but if the receiver tells it that the topology is outdated, the
originator retries?
 This is related to ISPN-8027 where we've discovered that some updates are not applied
when asynchronous commands to update the Hibernate 2L timestamp cache fail as a result of
an outdated topology. 

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[infinispan-issues] [JBoss JIRA] (ISPN-8182) Asynchronous commands should be retried if topology is outdated