[infinispan-dev] X-Site: Site Unreachable vs. Site Down
Mircea Markus
mircea.markus at jboss.com
Tue Sep 18 12:08:06 EDT 2012
Thanks.
On 18 Sep 2012, at 19:02, Erik Salter wrote:
> FYI: https://issues.jboss.org/browse/ISPN-2319
>
> -----Original Message-----
> From: infinispan-dev-bounces at lists.jboss.org
> [mailto:infinispan-dev-bounces at lists.jboss.org] On Behalf Of Bela Ban
> Sent: Monday, September 17, 2012 2:42 AM
> To: infinispan-dev at lists.jboss.org
> Subject: Re: [infinispan-dev] X-Site: Site Unreachable vs. Site Down
>
> I agree that there should be a configuration which determines after how many
> SITE-UNREACHABLE events (combined with a timeout), a site is declared as
> offline. (The count should be reset when there is a successful RPC to the
> remote site).
>
> Once a site is taken offline, then no RPCs would be sent to it, until it is
> taken online again (manually, by a sysadmin), and the state transfer has
> completed.
>
> Example: lon={A,B,C}, sfo={X,Y,Z}.
> - We're in London (lon), sfo acts as the backup site to lon
> - An RPC in lon includes A,B and SiteMaster(sfo) as targets
> - Before the RPC hits X, X crashes
> - JGroups retries X (a few times, timeout < "timeout" configured in <backup
> site=.../>)
> - Y takes over
> - JGroups re-routes the RPC to Y
> - The caller completes the RPC successfully
>
> - Now connectivity to sfo goes down
> - A caller in lon invokes an RPC on A,C and SiteMaster(sfo)
> - The call fails after 16s
> - Another RPC fails after 16s
> - After N failed RPCs, Infinispan in lon marks sfo as down (offline)
> - The next RPC has B and C as targets, but not SiteMaster(sfo) anymore,
> until sfo is brought online (manually)
>
>
> On 9/17/12 3:21 AM, Erik Salter wrote:
>> Hi all,
>>
>> For the X-Site pull request, Bela, Mircea and I had a design review. One
> of
>> the items that came up was the ability to mark a site as being "down" -
>> where a site has been unreachable for a period of time. This mostly
> applies
>> to the synchronous replication case where the backup failure policy has
> been
>> configured as "FAIL", i.e:
>>
>> <namedCache name="importantCache">
>> <sites>
>> <backups>
>>
>> <backup site="NYC" strategy="SYNC" backupFailurePolicy="FAIL"
> timeout="16000
>> 0"/>
>> </backups>
>> </sites>
>> </namedCache>
>>
>> The current implementation would be to fail all requests until a SA
> realizes
>> the site is offline and mark it through a JMX operation (provided in this
>> release?). Since I cannot afford a 100% failure rate until somebody gets
>> called, I think we need to take it a step further and add an element to
> mark
>> a site as offline after a period of time. (Note, though, a site can only
>> be brought back online manually.)
>>
>> Mircea talked about adding an element in the configuration for a custom
>> callback implementation. However, I think this is useful enough -- not
> only
>> for me -- but for other ISPN/JDG users as well. (Not to mention we can't
>> add configuration for callbacks)
>
>
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
Cheers,
--
Mircea Markus
Infinispan lead (www.infinispan.org)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/infinispan-dev/attachments/20120918/6175344e/attachment.html
More information about the infinispan-dev
mailing list