[JBoss JIRA] (ISPN-8175) LocalCacheStateTransferTest random failures
by Pedro Ruivo (JIRA)
[ https://issues.jboss.org/browse/ISPN-8175?page=com.atlassian.jira.plugin.... ]
Pedro Ruivo updated ISPN-8175:
------------------------------
Status: Pull Request Sent (was: Open)
Git Pull Request: https://github.com/infinispan/infinispan/pull/5363
> LocalCacheStateTransferTest random failures
> -------------------------------------------
>
> Key: ISPN-8175
> URL: https://issues.jboss.org/browse/ISPN-8175
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.1.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Pedro Ruivo
> Labels: testsuite_stability
>
> {noformat}
> [ERROR] testStateTransferWithClusterIdle(org.infinispan.xsite.statetransfer.LocalCacheStateTransferTest) Time elapsed: 0.698 s <<< FAILURE!
> java.lang.AssertionError:
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
> at org.testng.AssertJUnit.assertTrue(AssertJUnit.java:24)
> at org.testng.AssertJUnit.assertFalse(AssertJUnit.java:41)
> at org.testng.AssertJUnit.assertFalse(AssertJUnit.java:49)
> at org.infinispan.xsite.statetransfer.LocalCacheStateTransferTest.assertNoStateTransferInReceivingSite(LocalCacheStateTransferTest.java:147)
> at org.infinispan.xsite.statetransfer.LocalCacheStateTransferTest.testStateTransferWithClusterIdle(LocalCacheStateTransferTest.java:96)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months
[JBoss JIRA] (ISPN-8175) LocalCacheStateTransferTest random failures
by Pedro Ruivo (JIRA)
[ https://issues.jboss.org/browse/ISPN-8175?page=com.atlassian.jira.plugin.... ]
Pedro Ruivo reassigned ISPN-8175:
---------------------------------
Assignee: Pedro Ruivo
> LocalCacheStateTransferTest random failures
> -------------------------------------------
>
> Key: ISPN-8175
> URL: https://issues.jboss.org/browse/ISPN-8175
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 9.1.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Pedro Ruivo
> Labels: testsuite_stability
>
> {noformat}
> [ERROR] testStateTransferWithClusterIdle(org.infinispan.xsite.statetransfer.LocalCacheStateTransferTest) Time elapsed: 0.698 s <<< FAILURE!
> java.lang.AssertionError:
> at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
> at org.testng.AssertJUnit.assertTrue(AssertJUnit.java:24)
> at org.testng.AssertJUnit.assertFalse(AssertJUnit.java:41)
> at org.testng.AssertJUnit.assertFalse(AssertJUnit.java:49)
> at org.infinispan.xsite.statetransfer.LocalCacheStateTransferTest.assertNoStateTransferInReceivingSite(LocalCacheStateTransferTest.java:147)
> at org.infinispan.xsite.statetransfer.LocalCacheStateTransferTest.testStateTransferWithClusterIdle(LocalCacheStateTransferTest.java:96)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months
[JBoss JIRA] (ISPN-8176) RemoteCacheStoreIT.testReadOnly random failures
by Ryan Emerson (JIRA)
[ https://issues.jboss.org/browse/ISPN-8176?page=com.atlassian.jira.plugin.... ]
Ryan Emerson updated ISPN-8176:
-------------------------------
Status: Open (was: New)
> RemoteCacheStoreIT.testReadOnly random failures
> -----------------------------------------------
>
> Key: ISPN-8176
> URL: https://issues.jboss.org/browse/ISPN-8176
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 9.1.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Ryan Emerson
> Labels: testsuite_stability
>
> java.lang.AssertionError: expected null, but was:<v1>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotNull(Assert.java:664)
> at org.junit.Assert.assertNull(Assert.java:646)
> at org.junit.Assert.assertNull(Assert.java:656)
> at org.infinispan.server.test.cs.remote.RemoteCacheStoreIT.testReadOnly(RemoteCacheStoreIT.java:85)
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months
[JBoss JIRA] (ISPN-8075) LiveRuningTest fails randomly with CorruptIndexException
by Gustavo Fernandes (JIRA)
[ https://issues.jboss.org/browse/ISPN-8075?page=com.atlassian.jira.plugin.... ]
Gustavo Fernandes commented on ISPN-8075:
-----------------------------------------
Was only able to reproduce by running the test in a loop inside a VirtualBox running on a dual core latop
> LiveRuningTest fails randomly with CorruptIndexException
> --------------------------------------------------------
>
> Key: ISPN-8075
> URL: https://issues.jboss.org/browse/ISPN-8075
> Project: Infinispan
> Issue Type: Bug
> Components: Embedded Querying
> Affects Versions: 9.1.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Gustavo Fernandes
>
> {noformat}
> org.hibernate.search.exception.SearchException: Unable to reopen IndexReader
> at org.hibernate.search.indexes.impl.SharingBufferReaderProvider$PerDirectoryLatestReader.refreshAndGet(SharingBufferReaderProvider.java:242)
> at org.hibernate.search.indexes.impl.SharingBufferReaderProvider.openIndexReader(SharingBufferReaderProvider.java:73)
> at org.hibernate.search.indexes.impl.SharingBufferReaderProvider.openIndexReader(SharingBufferReaderProvider.java:35)
> at org.hibernate.search.reader.impl.ManagedMultiReader.createInstance(ManagedMultiReader.java:69)
> at org.hibernate.search.reader.impl.MultiReaderFactory.openReader(MultiReaderFactory.java:48)
> at org.hibernate.search.query.engine.impl.LuceneHSQuery.buildSearcher(LuceneHSQuery.java:475)
> at org.hibernate.search.query.engine.impl.LuceneHSQuery.queryResultSize(LuceneHSQuery.java:218)
> at org.hibernate.search.query.hibernate.impl.FullTextQueryImpl.doGetResultSize(FullTextQueryImpl.java:269)
> at org.hibernate.search.query.hibernate.impl.FullTextQueryImpl.getResultSize(FullTextQueryImpl.java:260)
> at org.infinispan.hibernate.search.LiveRunningTest.assertView(LiveRunningTest.java:79)
> at org.infinispan.hibernate.search.LiveRunningTest.assertViews(LiveRunningTest.java:70)
> at org.infinispan.hibernate.search.LiveRunningTest.liveRun(LiveRunningTest.java:57)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
> at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
> at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
> at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
> at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
> Caused by: java.io.IOException: Read past EOF
> at org.infinispan.lucene.impl.SingleChunkIndexInput.readByte(SingleChunkIndexInput.java:54)
> at org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:41)
> at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
> at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
> at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
> at org.apache.lucene.codecs.lucene50.Lucene50SegmentInfoFormat.read(Lucene50SegmentInfoFormat.java:86)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:362)
> at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:493)
> at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:490)
> at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
> at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
> at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:490)
> at org.apache.lucene.index.StandardDirectoryReader.isCurrent(StandardDirectoryReader.java:344)
> at org.apache.lucene.index.StandardDirectoryReader.doOpenNoWriter(StandardDirectoryReader.java:300)
> at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:263)
> at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:251)
> at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:137)
> at org.hibernate.search.indexes.impl.SharingBufferReaderProvider$PerDirectoryLatestReader.refreshAndGet(SharingBufferReaderProvider.java:239)
> ... 37 more
> Suppressed: org.apache.lucene.index.CorruptIndexException: checksum status indeterminate: remaining=0, please run checkindex for more details (resource=BufferedChecksumIndexInput(_b.si))
> at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:370)
> at org.apache.lucene.codecs.lucene50.Lucene50SegmentInfoFormat.read(Lucene50SegmentInfoFormat.java:117)
> ... 49 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months
[JBoss JIRA] (ISPN-8176) RemoteCacheStoreIT.testReadOnly random failures
by Ryan Emerson (JIRA)
[ https://issues.jboss.org/browse/ISPN-8176?page=com.atlassian.jira.plugin.... ]
Ryan Emerson reassigned ISPN-8176:
----------------------------------
Assignee: Ryan Emerson
> RemoteCacheStoreIT.testReadOnly random failures
> -----------------------------------------------
>
> Key: ISPN-8176
> URL: https://issues.jboss.org/browse/ISPN-8176
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 9.1.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Ryan Emerson
> Labels: testsuite_stability
>
> java.lang.AssertionError: expected null, but was:<v1>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotNull(Assert.java:664)
> at org.junit.Assert.assertNull(Assert.java:646)
> at org.junit.Assert.assertNull(Assert.java:656)
> at org.infinispan.server.test.cs.remote.RemoteCacheStoreIT.testReadOnly(RemoteCacheStoreIT.java:85)
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months
[JBoss JIRA] (ISPN-8182) Asynchronous commands should be retried if topology is outdated
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-8182?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño commented on ISPN-8182:
----------------------------------------
A couple of IRC discussions we've had so far:
{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator forgets the command immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw an OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it. If you want to apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes throwing OTE is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could try to do that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an owner or not. throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care if the cache is async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a separate threads (with the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything below STI should be executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to proper primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}
And:
{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
does not have to be thrown at all
> rvansa: right, assuming we provide our own interceptor for timestamp
and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
infinispan issue
> rvansa: both for REPL and DIST?
<rvansa> galderz: yesterday before the meeting I've suggested that
async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
> rvansa: makes sense
<rvansa> galderz: actually, throwing and retrying locally might be
needed - I would prefer to fix topology for a given command below
STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
how to make the 'eventually' happen
> rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
node that did not get the update yet, its version would be stale
> dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
topology is outdated [New (Unresolved) Enhancement, Major, Core,
Unassigned] https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
all combos of primary/backup/non-owner transitions to see if
anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
leaves, some keys will have new owners, and those owners may or
may not receive the updated value
> rvansa: if stale data is expected, then we're in the same scenario
as now really
<dberindei> galderz: the fact that we currently check the topology id
and throw an exception means the update will can be missed on
owners that aren't new
> what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
> dberindei: got it
> dberindei: so, should remote nodes not throw that exception for
async puts? or should still be thrown and then retried?
> dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
only change would be in StateTransferInterceptor (I think)
> dberindei: ok
> dberindei: do we have any stress tests where we could add tests for
seeing that it all works fine for repl async puts?
{code}
> Asynchronous commands should be retried if topology is outdated
> ---------------------------------------------------------------
>
> Key: ISPN-8182
> URL: https://issues.jboss.org/browse/ISPN-8182
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core
> Affects Versions: 9.1.0.Final
> Reporter: Galder Zamarreño
>
> If an asynchronous command fails at a remote node, it should be retried.
> I'm not sure how feasible this really is. One possible solution could be this: having NACK style implementation where by default the originator assumes an asynchronous command has been executed, but if the receiver tells it that the topology is outdated, the originator retries?
> This is related to ISPN-8027 where we've discovered that some updates are not applied when asynchronous commands to update the Hibernate 2L timestamp cache fail as a result of an outdated topology.
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months
[JBoss JIRA] (ISPN-8182) Asynchronous commands should be retried if topology is outdated
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-8182?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño edited comment on ISPN-8182 at 8/8/17 5:05 AM:
----------------------------------------------------------------
A couple of IRC discussions we've had so far:
{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator forgets the command
immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw an
OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it. If you want to
apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes throwing OTE
is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could try to do
that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an owner or not.
throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care if the cache is
async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a separate threads (with
the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything below STI should be
executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to proper
primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}
And:
{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
does not have to be thrown at all
> rvansa: right, assuming we provide our own interceptor for timestamp
and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
infinispan issue
> rvansa: both for REPL and DIST?
<rvansa> galderz: yesterday before the meeting I've suggested that
async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
> rvansa: makes sense
<rvansa> galderz: actually, throwing and retrying locally might be
needed - I would prefer to fix topology for a given command below
STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
how to make the 'eventually' happen
> rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
node that did not get the update yet, its version would be stale
> dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
topology is outdated [New (Unresolved) Enhancement, Major, Core,
Unassigned] https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
all combos of primary/backup/non-owner transitions to see if
anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
leaves, some keys will have new owners, and those owners may or
may not receive the updated value
> rvansa: if stale data is expected, then we're in the same scenario
as now really
<dberindei> galderz: the fact that we currently check the topology id
and throw an exception means the update will can be missed on
owners that aren't new
> what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
> dberindei: got it
> dberindei: so, should remote nodes not throw that exception for
async puts? or should still be thrown and then retried?
> dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
only change would be in StateTransferInterceptor (I think)
> dberindei: ok
> dberindei: do we have any stress tests where we could add tests for
seeing that it all works fine for repl async puts?
{code}
was (Author: galder.zamarreno):
A couple of IRC discussions we've had so far:
{code}
[15:52:15] > pruivo: dberindei: it'd be interesting to hear your thoughts about ISPN-8182
[15:53:47] <dberindei> galderz: I don't think it's feasible, the originator forgets the command immediately after sending it to the owners
[15:54:29] <dberindei> galderz: OTOH I don't think the remote nodes should throw an OutdatedTopologyException if the command is async
[15:54:50] <pruivo> dberindei, galderz well, IMO I don't think we should do it. If you want to apply and update async, use the putAsync()
[15:55:29] > rvansa: FYI ^
[15:55:55] <dberindei> pruivo: but you agree that in DIST_ASYNC, remote nodes throwing OTE is a bug, right?
[15:56:05] > dberindei: we already supply custom interceptors for HB 2L, so we could try to do that: not bothering about outdated topologies
[15:57:14] > pruivo: i guess you mean that putAsync() would retry in case of outdated topology?
[15:57:16] <pruivo> dberindei, more or less. I think it should check if it is an owner or not. throwing the exception definitely isn't needed
[15:57:42] <rvansa> dberindei: I think that the topology check in DI does not care if the cache is async
[15:57:42] <pruivo> galderz, yes, if I'm not mistaken, it is a sync put in a separate threads (with the benefits of sync mode)
[15:57:51] <rvansa> dberindei: if it does not match, it simply throws
[15:58:00] <dberindei> rvansa: ok, that's a bug
[15:58:39] <rvansa> dberindei: it shouldn't ignore it either, IMO... everything below STI should be executed in the same topology, IMO
[15:59:37] > dberindei: what's the bug?
[16:00:32] <rvansa> dberindei: I think that the DI should rather send the command to proper primary owner in the recent topology
[16:00:55] <rvansa> dberindei: if this node is meant as primary
{code}
And:
{code}
<rvansa> galderz: I think that you shouldn't need to catch - the OTE
does not have to be thrown at all
> rvansa: right, assuming we provide our own interceptor for timestamp
and query, we can just simply ignore any topology checks...
<rvansa> galderz: not only for our own interceptor, it's a general
infinispan issue
> rvansa: both for REPL and DIST?
<rvansa> galderz: yesterday before the meeting I've suggested that
async commands should just execute if these are still on the owner
<rvansa> galderz: yes
<rvansa> galderz: in async mode, after you call cache.put(k, v2), all
owners should eventually contain v2
<rvansa> galderz: unless 'error' happens
<rvansa> galderz: topology change (node joining) is not an error
> rvansa: makes sense
<rvansa> galderz: actually, throwing and retrying locally might be
needed - I would prefer to fix topology for a given command below
STI and retry if it changes in any place we need to consider it
<rvansa> galderz: that's a cozy invariant; not sure if it's really
needed here
<rvansa> galderz: anyway, regrettably we don't have any plan so far
how to make the 'eventually' happen
> rvansa: but we need something better than what we have now...
<rvansa> galderz: because if a new owner pops up and fetches data from
node that did not get the update yet, its version would be stale
> dberindei: pruivo: we were interrupted yday discussing ISPN-8182
<jbossbot> jira [ISPN-8182] Asynchronous commands should be retried if
topology is outdated [New (Unresolved) Enhancement, Major, Core,
Unassigned] https://issues.jboss.org/browse/ISPN-8182
<rvansa> galderz: quick fix would be just not throwing
<rvansa> galderz: + a set of stress tests that will try out this with
all combos of primary/backup/non-owner transitions to see if
anything goes wrong
<rvansa> 'wrong' meaning NPEs and such, stale data should be expected
in thos
<rvansa> those
*** First activity: dberindei joined 33 minutes 16 seconds ago.
<dberindei> galderz rvansa: indeed, if a new node joins OR a node
leaves, some keys will have new owners, and those owners may or
may not receive the updated value
> rvansa: if stale data is expected, then we're in the same scenario
as now really
<dberindei> galderz: the fact that we currently check the topology id
and throw an exception means the update will can be missed on
owners that aren't new
> what do you mean by "aren't new"?
<dberindei> galderz: say in topology 1 k is owned by AB, and in
topology 2 it's owned by CB
<dberindei> galderz: C would be a new owner, B would be "non-new" :)
> dberindei: got it
> dberindei: so, should remote nodes not throw that exception for
async puts? or should still be thrown and then retried?
> dberindei: we're assuming we'd change core for this
<dberindei> galderz: throwing and catching would be nice because the
only change would be in StateTransferInterceptor (I think)
> dberindei: ok
> dberindei: do we have any stress tests where we could add tests for
seeing that it all works fine for repl async puts?
{code}
> Asynchronous commands should be retried if topology is outdated
> ---------------------------------------------------------------
>
> Key: ISPN-8182
> URL: https://issues.jboss.org/browse/ISPN-8182
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core
> Affects Versions: 9.1.0.Final
> Reporter: Galder Zamarreño
>
> If an asynchronous command fails at a remote node, it should be retried.
> I'm not sure how feasible this really is. One possible solution could be this: having NACK style implementation where by default the originator assumes an asynchronous command has been executed, but if the receiver tells it that the topology is outdated, the originator retries?
> This is related to ISPN-8027 where we've discovered that some updates are not applied when asynchronous commands to update the Hibernate 2L timestamp cache fail as a result of an outdated topology.
--
This message was sent by Atlassian JIRA
(v7.2.3#72005)
7 years, 5 months