[infinispan-issues] [JBoss JIRA] (ISPN-4575) Map/Reduce incorrect results with a non-shared non-tx intermediate cache
Vladimir Blagojevic (JIRA)
issues at jboss.org
Wed Oct 8 10:53:12 EDT 2014
[ https://issues.jboss.org/browse/ISPN-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Blagojevic updated ISPN-4575:
--------------------------------------
Git Pull Request: https://github.com/infinispan/infinispan/pull/2853
> Map/Reduce incorrect results with a non-shared non-tx intermediate cache
> ------------------------------------------------------------------------
>
> Key: ISPN-4575
> URL: https://issues.jboss.org/browse/ISPN-4575
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Distributed Execution and Map/Reduce
> Affects Versions: 7.0.0.Alpha5
> Reporter: Dan Berindei
> Assignee: Vladimir Blagojevic
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.CR2
>
>
> In a non-tx cache, if a command is started with topology id {{T}}, and when it is replicated on another node the distribution interceptor sees topology {{T+1}}, it throws an {{OutdatedTopologyException}}. The originator of the command will then retry the command, setting topology {{T+1}}.
> When this happens with a {{PutKeyValueCommand(k, MapReduceManagerImpl.DeltaAwareList)}}, it can lead to duplicate intermediate values.
> Say _A_ is the primary owner of {{k}} in {{T}}, _B_ is a backup owner both in {{T}} and {{T+1}}, and _C_ is the backup owner in {{T}} and the primary owner in {{T+1}} (i.e. _C_ just joined and a rebalance is in progress during {{T}} - see {{NonTxBackupOwnerBecomingPrimaryOwnerTest}}).
> _A_ starts the {{PutKeyValueCommand}} and replicates it to _B_ and _C_. _C_ applies the command, but _B_ already has topology {{T+1}} and throws an {{OutdatedTopologyException}}. _A_ installs topology {{T+1}}, sends the command to _C_ (as the new primary owner), which replicates it to _B_ and then applies it locally a second time.
> This scenario can happen during a M/R task even without nodes joining or leaving. That's because {{CreateCacheCommand}} only calls {{getCache()}} on each member, it doesn't wait for the cache to have a certain number of members or for state transfer to be complete for all the members. The last member to join the intermediate cache is guaranteed to have topology {{T+1}}, but the others may have topology {{T}} by the time the combine phase starts inserting values in the intermediate cache.
> I have seen the {{OutdatedTopologyException}} happen pretty often during the test suite, especially after I removed the duplicate {{invokeRemotely}} call in {{MapReduceTask.executeTaskInit()}}. Most of them were harmless, but there was one failure in CI: http://ci.infinispan.org/viewLog.html?buildId=9811&tab=buildResultsDiv&buildTypeId=bt8
> A short-term fix would be to wait for all the members to finish joining in {{CreateCacheCommand}}. Long-term, M/R tasks should be resilient to topology changes, so we should investigate making {{PutKeyValue(k, DeltaAwareList)}} handle {{OutdatedTopologyException}} s.
--
This message was sent by Atlassian JIRA
(v6.3.1#6329)
More information about the infinispan-issues
mailing list