]
Tristan Tarrant updated ISPN-5570:
----------------------------------
Fix Version/s: 9.4.7.Final
(was: 9.4.6.Final)
Cross-site: retry backup commands
---------------------------------
Key: ISPN-5570
URL:
https://issues.jboss.org/browse/ISPN-5570
Project: Infinispan
Issue Type: Bug
Components: Core, Cross-Site Replication
Affects Versions: 7.2.3.Final
Reporter: Dan Berindei
Priority: Major
Fix For: 9.4.7.Final
There are 3 phases in a backup RPC:
1. Sender -> Local site master: caused by the site master is shutting down or
crashing, or by a network split.
2. Local site master -> Remote site master:
2.1. Local site master is no longer a site master, e.g. because it's shutting down or
because it's no longer coordinator after a merge.
2.2. Remote site master is not longer a site master.
2.3. Link between local site and remote site is down.
3. Remote site master -> Backup targets
Replication failures in phase 3 are handled by retrying (except for TimeoutExceptions),
because {{BaseBackupReceiver}} uses regular cache methods to perform the updates.
But replication failures in phases 1 and 2 are not handled in any way, except for causing
the remote site to be taken offline after a certain number of replication failures (if
backup is synchronous). We should instead retry backup RPCs when we get a
{{SuspectException}} or {{UnreachableException}}, and perhaps even when we get no response
(2.2?), and only stop when the timeout expires or when the backup is taken offline.
Async backup probably needs retrying as well, and perhaps even a more sophisticated
approach like I-RAC (ISPN-2634).