[Hawkular-dev] Availability revisited

Wed Jul 15 21:45:42 EDT 2015

> On Jul 13, 2015, at 8:36 AM, Heiko W.Rupp <hrupp at redhat.com> wrote:
> 
> Hey,
> 
> we did talk about Availability and computed state in the past
> 
> Now triggered by https://issues.jboss.org/browse/HAWKULAR-401
> and also https://issues.jboss.org/browse/HAWKULAR-407
> we need to revisit this and finally start including it in the code base.
> 
> In -407 we have the issue that the server can currently not detect that
> a feed is down. For the WF-agent, this is likely to be solved with the 
> new
> feed-comm system, that can see disconnect messages [1] and act 
> accordingly

Are there any docs, notes, etc. on the feed-comm system? I am not familiar with this.

> (i.E. server side add a synthetic "down" event into the availability 
> data stream.
> Of course other feeds can also use that mechanism.
> 
> A generic feed though, that is sending availability records from time to 
> time
> is most probably not sending a "down" event in the case that it is going
> down or crashing. So we need to have a periodic job looking for feeds
> that did not talk to us for a longer period of time.
> This also implies that at least the in-memory state for feed 
> availability
> needs to be updated with a last-seen record, as Micke described some 
> time
> ago ( that last seen record should probably be flushed to C* from time 
> to
> time).

Why do we need to store the last seen availability in memory?

> Also we would need to require "generic" feeds to do some heartbeats by
> sending their availability once per minute at least.
> 
> Now for -401, which is trickier. If e.g. a WildFly is in state 
> 'reload-needed',
> it is technically up, but its configuration has pending changes.
> 
> So we would need "up" availability, and then another (sub) state 
> indicating
> the pending change.
> And then we may have state like "maintenance mode", where a resource
> may be up or down without impacting e.g. alerting or any SLA 
> computation.
> 
> From those raw input variables we would then compute the resource
> state
> http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000413.html
> 
> While this could be up/down/unknown/(mixed for groups), it will also 
> mean
> that we need to convey the other information to the user. If e.g. a 
> resource
> is in maintenance mode, the user should be informed why alerts on the
> resource do not fire.
> Likewise for reload-needed: the user needs to know why the recent 
> changes
> he or she made did not change the way the appserver works.
> Treating reload-needed as just "down" is wrong, as the server continues 
> to
> work and serve requests.

If you talking about correlation, then I am +1. When I think about RHQ, the user could easily see availability state change, but he would have to go hunting around to see what precipitated it.

> 
> The above of course has an impact on storage. Right now we only store
> up/down/unknown (as text) for availability, but we certainly would need
> to also store sub-state.
> For the maintenance-mode, this is orthogonal to all the above and should
> probably a "flag" on a graph of resources.
> 
>   Heiko
> 
> [1] @OnClose is called with a code of 1006 on client crash/abnormal 
> termination.
> See http://tools.ietf.org/html/rfc6455#section-7.4
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev