I agree that there should be a configuration which determines after how
many SITE-UNREACHABLE events (combined with a timeout), a site is
declared as offline. (The count should be reset when there is a
successful RPC to the remote site).
Once a site is taken offline, then no RPCs would be sent to it, until it
is taken online again (manually, by a sysadmin), and the state transfer
has completed.
Example: lon={A,B,C}, sfo={X,Y,Z}.
- We're in London (lon), sfo acts as the backup site to lon
- An RPC in lon includes A,B and SiteMaster(sfo) as targets
- Before the RPC hits X, X crashes
- JGroups retries X (a few times, timeout < "timeout" configured in
<backup site=.../>)
- Y takes over
- JGroups re-routes the RPC to Y
- The caller completes the RPC successfully
- Now connectivity to sfo goes down
- A caller in lon invokes an RPC on A,C and SiteMaster(sfo)
- The call fails after 16s
- Another RPC fails after 16s
- After N failed RPCs, Infinispan in lon marks sfo as down (offline)
- The next RPC has B and C as targets, but not SiteMaster(sfo) anymore,
until sfo is brought online (manually)
On 9/17/12 3:21 AM, Erik Salter wrote:
Hi all,
For the X-Site pull request, Bela, Mircea and I had a design review. One of
the items that came up was the ability to mark a site as being “down” –
where a site has been unreachable for a period of time. This mostly applies
to the synchronous replication case where the backup failure policy has been
configured as “FAIL”, i.e:
<namedCache name="importantCache">
<sites>
<backups>
<backup site="NYC" strategy="SYNC"
backupFailurePolicy="FAIL" timeout="16000
0"/>
</backups>
</sites>
</namedCache>
The current implementation would be to fail all requests until a SA realizes
the site is offline and mark it through a JMX operation (provided in this
release?). Since I cannot afford a 100% failure rate until somebody gets
called, I think we need to take it a step further and add an element to mark
a site as offline after a period of time. (Note, though, a site can only
be brought back online manually.)
Mircea talked about adding an element in the configuration for a custom
callback implementation. However, I think this is useful enough -- not only
for me -- but for other ISPN/JDG users as well. (Not to mention we can't
add configuration for callbacks)
--
Bela Ban, JGroups lead (
http://www.jgroups.org)