[infinispan-dev] X-Site: Site Unreachable vs. Site Down

Erik Salter an1310 at hotmail.com
Tue Sep 18 12:02:37 EDT 2012


FYI:  https://issues.jboss.org/browse/ISPN-2319

-----Original Message-----
From: infinispan-dev-bounces at lists.jboss.org
[mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Bela Ban
Sent: Monday, September 17, 2012 2:42 AM
To: infinispan-dev at lists.jboss.org
Subject: Re: [infinispan-dev] X-Site: Site Unreachable vs. Site Down

I agree that there should be a configuration which determines after how many
SITE-UNREACHABLE events (combined with a timeout), a site is declared as
offline. (The count should be reset when there is a successful RPC to the
remote site).

Once a site is taken offline, then no RPCs would be sent to it, until it is
taken online again (manually, by a sysadmin), and the state transfer has
completed.

Example: lon={A,B,C}, sfo={X,Y,Z}.
- We're in London (lon), sfo acts as the backup site to lon
- An RPC in lon includes A,B and SiteMaster(sfo) as targets
- Before the RPC hits X, X crashes
- JGroups retries X (a few times, timeout < "timeout" configured in <backup
site=.../>)
- Y takes over
- JGroups re-routes the RPC to Y
- The caller completes the RPC successfully

- Now connectivity to sfo goes down
- A caller in lon invokes an RPC on A,C and SiteMaster(sfo)
- The call fails after 16s
- Another RPC fails after 16s
- After N failed RPCs, Infinispan in lon marks sfo as down (offline)
- The next RPC has B and C as targets, but not SiteMaster(sfo) anymore,
until sfo is brought online (manually)


On 9/17/12 3:21 AM, Erik Salter wrote:
> Hi all,
>
> For the X-Site pull request, Bela, Mircea and I had a design review.  One
of
> the items that came up was the ability to mark a site as being "down" -
> where a site has been unreachable for a period of time.  This mostly
applies
> to the synchronous replication case where the backup failure policy has
been
> configured as "FAIL", i.e:
>
> <namedCache name="importantCache">
>   <sites>
>      <backups>
>
> <backup site="NYC" strategy="SYNC" backupFailurePolicy="FAIL"
timeout="16000
> 0"/>
>     </backups>
> </sites>
> </namedCache>
>
> The current implementation would be to fail all requests until a SA
realizes
> the site is offline and mark it through a JMX  operation (provided in this
> release?).   Since I cannot afford a 100% failure rate until somebody gets
> called, I think we need to take it a step further and add an element to
mark
> a site as offline after a period of time.   (Note, though, a site can only
> be brought back online manually.)
>
> Mircea talked about adding an element in the configuration for a custom
> callback implementation.  However, I think this is useful enough -- not
only
> for me -- but for other ISPN/JDG users as well.  (Not to mention we can't
> add configuration for callbacks)


-- 
Bela Ban, JGroups lead (http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
infinispan-dev at lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev



More information about the infinispan-dev mailing list