[infinispan-issues] [JBoss JIRA] (ISPN-11176) XSite Max Idle

Tue Aug 11 04:52:00 EDT 2020

    [ https://issues.redhat.com/browse/ISPN-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379906#comment-14379906 ] 

Dan Berindei commented on ISPN-11176:
-------------------------------------

{quote}
I thought I remember hearing something about all nodes can be site masters now or something? I really don't know much about xsite/IRAC details though.
{quote}

In theory, yes. In practice, I think we only set one.

{quote}
But if a single node failure can cause a site to become offline, that would not be good.
{quote}

It's a bit trickier... a site will be taken online based on the {{TakeOfflineConfiguration}}, which has two attributes: {{after-failures}} and {{min-wait}}. By default they are both 0, so only the administrator can take the site offline manually. If you set both, then the site will be taken offline if {{after-failures}} consecutive backup operations fail (I assume we'll want to ignore {{check last access}} and {{touch}} commands here) and at least {{min-wait}} millis passed since the first of those consecutive failures. This means you can have sites that are present in the bridge cluster view but are not yet taken offline, sites that are offline and yet they're in the bridge cluster view, and sites that are both offline for some caches an online for other caches. You can also have operations that still wait for backup responses after the remote site was taken offline, and RPCs that time out and fail without the site being taken offline.

I'd love it if we could improve the take-offline story, so it's more similar to how failure detection works in a JGroups cluster (and global instead of per-cache, and more friendly for active-active setups), but for now we need to be careful with terminology: a site being taken offline is different from a site becoming unreachable (because it doesn't have any node in the bridge cluster view), and "an entire site is lost" could mean either.

> XSite Max Idle
> --------------
>
>                 Key: ISPN-11176
>                 URL: https://issues.redhat.com/browse/ISPN-11176
>             Project: Infinispan
>          Issue Type: Enhancement
>          Components: Cross-Site Replication, Expiration
>            Reporter: Will Burns
>            Assignee: Will Burns
>            Priority: Major
>             Fix For: 12.0.0.Final
>
>
> Max idle expiration currently doesn't work with xsite. That is if an entry was written and replicated to both sites but one site never reads the value, but the other does. If they then need to read the value from the other site it will be expired (assuming the max idle time has elapsed).
> There are a few ways we can do this.
> 1. Keep access times local to every site. When a site finds an entry is expired it asks the other site(s) if it has a more recent access. If a site is known to have gone down we should touch all entries, since they may not have updated access times. Requires very little additional xsite communication.
> 2. Batch touch commands and only send every so often. Has window of loss, but should be small. Requires more site usage. Wouldn't work for really low max idle times as an entry could expire before the touch command is replicated.

--
This message was sent by Atlassian Jira
(v7.13.8#713008)