[Hawkular-dev] Availability revisited

Mon Jul 13 08:36:56 EDT 2015

Hey,

we did talk about Availability and computed state in the past

Now triggered by https://issues.jboss.org/browse/HAWKULAR-401
and also https://issues.jboss.org/browse/HAWKULAR-407
we need to revisit this and finally start including it in the code base.

In -407 we have the issue that the server can currently not detect that
a feed is down. For the WF-agent, this is likely to be solved with the 
new
feed-comm system, that can see disconnect messages [1] and act 
accordingly
(i.E. server side add a synthetic "down" event into the availability 
data stream.
Of course other feeds can also use that mechanism.

A generic feed though, that is sending availability records from time to 
time
is most probably not sending a "down" event in the case that it is going
down or crashing. So we need to have a periodic job looking for feeds
that did not talk to us for a longer period of time.
This also implies that at least the in-memory state for feed 
availability
needs to be updated with a last-seen record, as Micke described some 
time
ago ( that last seen record should probably be flushed to C* from time 
to
time).
Also we would need to require "generic" feeds to do some heartbeats by
sending their availability once per minute at least.

Now for -401, which is trickier. If e.g. a WildFly is in state 
'reload-needed',
it is technically up, but its configuration has pending changes.

So we would need "up" availability, and then another (sub) state 
indicating
the pending change.
And then we may have state like "maintenance mode", where a resource
may be up or down without impacting e.g. alerting or any SLA 
computation.

 From those raw input variables we would then compute the resource
state
http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000413.html

While this could be up/down/unknown/(mixed for groups), it will also 
mean
that we need to convey the other information to the user. If e.g. a 
resource
is in maintenance mode, the user should be informed why alerts on the
resource do not fire.
Likewise for reload-needed: the user needs to know why the recent 
changes
he or she made did not change the way the appserver works.
Treating reload-needed as just "down" is wrong, as the server continues 
to
work and serve requests.

The above of course has an impact on storage. Right now we only store
up/down/unknown (as text) for availability, but we certainly would need
to also store sub-state.
For the maintenance-mode, this is orthogonal to all the above and should
probably a "flag" on a graph of resources.

   Heiko

[1] @OnClose is called with a code of 1006 on client crash/abnormal 
termination.
See http://tools.ietf.org/html/rfc6455#section-7.4