On Jul 13, 2015, at 8:36 AM, Heiko W.Rupp <hrupp(a)redhat.com>
wrote:
Hey,
we did talk about Availability and computed state in the past
Now triggered by
https://issues.jboss.org/browse/HAWKULAR-401
and also
https://issues.jboss.org/browse/HAWKULAR-407
we need to revisit this and finally start including it in the code base.
In -407 we have the issue that the server can currently not detect that
a feed is down. For the WF-agent, this is likely to be solved with the
new
feed-comm system, that can see disconnect messages [1] and act
accordingly
Are there any docs, notes, etc. on the feed-comm system? I am not familiar with this.
(i.E. server side add a synthetic "down" event into the
availability
data stream.
Of course other feeds can also use that mechanism.
A generic feed though, that is sending availability records from time to
time
is most probably not sending a "down" event in the case that it is going
down or crashing. So we need to have a periodic job looking for feeds
that did not talk to us for a longer period of time.
This also implies that at least the in-memory state for feed
availability
needs to be updated with a last-seen record, as Micke described some
time
ago ( that last seen record should probably be flushed to C* from time
to
time).
Why do we need to store the last seen availability in memory?
Also we would need to require "generic" feeds to do some
heartbeats by
sending their availability once per minute at least.
Now for -401, which is trickier. If e.g. a WildFly is in state
'reload-needed',
it is technically up, but its configuration has pending changes.
So we would need "up" availability, and then another (sub) state
indicating
the pending change.
And then we may have state like "maintenance mode", where a resource
may be up or down without impacting e.g. alerting or any SLA
computation.
From those raw input variables we would then compute the resource
state
http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000413.html
While this could be up/down/unknown/(mixed for groups), it will also
mean
that we need to convey the other information to the user. If e.g. a
resource
is in maintenance mode, the user should be informed why alerts on the
resource do not fire.
Likewise for reload-needed: the user needs to know why the recent
changes
he or she made did not change the way the appserver works.
Treating reload-needed as just "down" is wrong, as the server continues
to
work and serve requests.
If you talking about correlation, then I am +1. When I think about RHQ, the user could
easily see availability state change, but he would have to go hunting around to see what
precipitated it.
The above of course has an impact on storage. Right now we only store
up/down/unknown (as text) for availability, but we certainly would need
to also store sub-state.
For the maintenance-mode, this is orthogonal to all the above and should
probably a "flag" on a graph of resources.
Heiko
[1] @OnClose is called with a code of 1006 on client crash/abnormal
termination.
See
http://tools.ietf.org/html/rfc6455#section-7.4
_______________________________________________
hawkular-dev mailing list
hawkular-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hawkular-dev