[
https://issues.jboss.org/browse/ISPN-8182?page=com.atlassian.jira.plugin....
]
Galder Zamarreño edited comment on ISPN-8182 at 8/8/17 5:05 AM:
----------------------------------------------------------------
A couple of IRC discussions we've had so far:
{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about
ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator
forgets the command
immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw
an
OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it.
If you want to
apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes
throwing OTE
is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could
try to do
that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated
topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an
owner or not.
throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care
if the cache is
async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a
separate threads (with
the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything
below STI should be
executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to
proper
primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}
And:
{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
does not have to be thrown at all
rvansa: right, assuming we provide our own interceptor for timestamp
and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
infinispan issue
rvansa: both for REPL and DIST?
<rvansa> galderz:
yesterday before the meeting I've suggested that
async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
rvansa: makes sense
<rvansa> galderz: actually, throwing
and retrying locally might be
needed - I would prefer to fix topology for a given command below
STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
how to make the 'eventually' happen
rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
node that did not get the update yet, its version would be stale
dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
topology is outdated [New (Unresolved) Enhancement, Major, Core,
Unassigned]
https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
all combos of primary/backup/non-owner transitions to see if
anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
leaves, some keys will have new owners, and those owners may or
may not receive the updated value
rvansa: if stale data is expected, then we're in the same
scenario
as now really
<dberindei> galderz: the fact that we currently check the topology id
and throw an exception means the update will can be missed on
owners that aren't new
what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
dberindei: got it
dberindei: so, should remote nodes not throw that exception for
async puts? or
should still be thrown and then retried?
dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
only change would be in StateTransferInterceptor (I think)
dberindei: ok
dberindei: do we have any stress tests where we could add tests for
seeing that
it all works fine for repl async puts?
{code}
was (Author: galder.zamarreno):
A couple of IRC discussions we've had so far:
{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about
ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator
forgets the command immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw
an OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it.
If you want to apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes
throwing OTE is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could
try to do that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated
topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an
owner or not. throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care
if the cache is async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a
separate threads (with the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything
below STI should be executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to
proper primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}
And:
{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
does not have to be thrown at all
rvansa: right, assuming we provide our own interceptor for timestamp
and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
infinispan issue
rvansa: both for REPL and DIST?
<rvansa> galderz:
yesterday before the meeting I've suggested that
async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
rvansa: makes sense
<rvansa> galderz: actually, throwing
and retrying locally might be
needed - I would prefer to fix topology for a given command below
STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
how to make the 'eventually' happen
rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
node that did not get the update yet, its version would be stale
dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
topology is outdated [New (Unresolved) Enhancement, Major, Core,
Unassigned]
https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
all combos of primary/backup/non-owner transitions to see if
anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
leaves, some keys will have new owners, and those owners may or
may not receive the updated value
rvansa: if stale data is expected, then we're in the same
scenario
as now really
<dberindei> galderz: the fact that we currently check the topology id
and throw an exception means the update will can be missed on
owners that aren't new
what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
dberindei: got it
dberindei: so, should remote nodes not throw that exception for
async puts? or
should still be thrown and then retried?
dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
only change would be in StateTransferInterceptor (I think)
dberindei: ok
dberindei: do we have any stress tests where we could add tests for
seeing that
it all works fine for repl async puts?
{code}
Asynchronous commands should be retried if topology is outdated
---------------------------------------------------------------
Key: ISPN-8182
URL:
https://issues.jboss.org/browse/ISPN-8182
Project: Infinispan
Issue Type: Enhancement
Components: Core
Affects Versions: 9.1.0.Final
Reporter: Galder Zamarreño
If an asynchronous command fails at a remote node, it should be retried.
I'm not sure how feasible this really is. One possible solution could be this: having
NACK style implementation where by default the originator assumes an asynchronous command
has been executed, but if the receiver tells it that the topology is outdated, the
originator retries?
This is related to ISPN-8027 where we've discovered that some updates are not applied
when asynchronous commands to update the Hibernate 2L timestamp cache fail as a result of
an outdated topology.
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)