[
https://issues.redhat.com/browse/ISPN-11176?page=com.atlassian.jira.plugi...
]
Dan Berindei commented on ISPN-11176:
-------------------------------------
{quote}
I thought I remember hearing something about all nodes can be site masters now or
something? I really don't know much about xsite/IRAC details though.
{quote}
In theory, yes. In practice, I think we only set one.
{quote}
But if a single node failure can cause a site to become offline, that would not be good.
{quote}
It's a bit trickier... a site will be taken online based on the
{{TakeOfflineConfiguration}}, which has two attributes: {{after-failures}} and
{{min-wait}}. By default they are both 0, so only the administrator can take the site
offline manually. If you set both, then the site will be taken offline if
{{after-failures}} consecutive backup operations fail (I assume we'll want to ignore
{{check last access}} and {{touch}} commands here) and at least {{min-wait}} millis passed
since the first of those consecutive failures. This means you can have sites that are
present in the bridge cluster view but are not yet taken offline, sites that are offline
and yet they're in the bridge cluster view, and sites that are both offline for some
caches an online for other caches. You can also have operations that still wait for backup
responses after the remote site was taken offline, and RPCs that time out and fail without
the site being taken offline.
I'd love it if we could improve the take-offline story, so it's more similar to
how failure detection works in a JGroups cluster (and global instead of per-cache, and
more friendly for active-active setups), but for now we need to be careful with
terminology: a site being taken offline is different from a site becoming unreachable
(because it doesn't have any node in the bridge cluster view), and "an entire
site is lost" could mean either.
XSite Max Idle
--------------
Key: ISPN-11176
URL:
https://issues.redhat.com/browse/ISPN-11176
Project: Infinispan
Issue Type: Enhancement
Components: Cross-Site Replication, Expiration
Reporter: Will Burns
Assignee: Will Burns
Priority: Major
Fix For: 12.0.0.Final
Max idle expiration currently doesn't work with xsite. That is if an entry was
written and replicated to both sites but one site never reads the value, but the other
does. If they then need to read the value from the other site it will be expired (assuming
the max idle time has elapsed).
There are a few ways we can do this.
1. Keep access times local to every site. When a site finds an entry is expired it asks
the other site(s) if it has a more recent access. If a site is known to have gone down we
should touch all entries, since they may not have updated access times. Requires very
little additional xsite communication.
2. Batch touch commands and only send every so often. Has window of loss, but should be
small. Requires more site usage. Wouldn't work for really low max idle times as an
entry could expire before the touch command is replicated.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)