[Hawkular-dev] RfC: Availability

Fri Mar 20 08:55:48 EDT 2015

Great write-up Heiko, let me try to intersperse the outcomes of the 
yesterday's watercooler into it so that we get the discussion started.

tl;dr - your ideas match almost exactly what we discussed on the meeting ;)

The basic outcome of the meeting IMHO was that availability is in its most 
generic form a computed value and that the feeds might not be able to decide 
whether something is up or down. Judging such thing on "logical" resources 
like "applications" must be handled at some other layer either by polling for 
other data on a schedule and reporting the findings back into Hawkular by a 
completely 3rd party "reporter" or maybe such reporters would be hooked on the 
bus to receive "realtime" info about events and be able to compute the avail 
reactively.

Also, it was agreed that even the availability states are not something we can 
decide upfront. Maybe we'd provide a number of avail values with undefined 
meaning that only avail "reporter" would be able define or we could understand 
avail as "health" ranging from 0-100% and again let the avail reporter decide 
on a value.

On metrics level, the agreement was to keep the "availability" endpoints to be 
able to distinguish it from "normal" metrics. But there also were suggestions 
to just be able to mark some metrics as run-length encoded and consider avail 
as an ordinary metric.

Other subsystems need to distinguish between avail and other metrics. 
Especially inventory needs to be told which metric corresponds to an avail 
value for particular resource.

On Friday, March 20, 2015 11:20:17 Heiko W.Rupp wrote:
> Hey,
> 
> there was apparently some watercooler discussion yesterday without any
> minutes, so the following
> will not be able to refer to it in any way.
> 
> Hawkular needs to have a way to store, retrieve and display availability
> of a resource or a bunch of them [1].
> 
> While we have some short term goals, for the longer run we need to
> better identify what needs to be done.
> I think we need to separately look at the following concerns:
> 
> * availability reporting
>    * api
>    *values
> * availability computation
> * availability storage
> * availability retrieval
> * alerting on availability
> * computed resource state
> 
> The basic assumption here is that availability is something relatively
> stable. Meaning that usually
> the same state (hopefully "UP") is reported each time in a row for a
> veeery long period of time (there
> are some servers with uptimes >> 1 year).
> 
> == reporting
> 
> Feeds report availability to Hawkular, where the data may be further
> processed and stored.
> The reported values are probably in the range of "UP", "DOWN". I can
> also imagine that e.g.
> an application server that starts shutting down could send a
> "GOING_DOWN" value.
> 

If the feed is able to report on a "state" of something, I imagine we need at 
least 3 states: UP, DOWN and UNKNOWN. I imagine that a feed in and of itself 
wouldn't be able to report more elaborate states like "degraded" or "half-
functional" or "up but rest api under heavy load" or such like.

I imagine there either being a reporter that would compute such more involved 
states somewhere "up in the pipeline" or there could be some default simple 
translation like UP = 100%, DOWN = 0%, UNKNOWN = UNKNOWN (btw. I think 
"unknown" should be implicitly possible on any metric when I think about it, 
because it is distinctively different from "not collected").

The problem with GOING_DOWN is that it seems to me to be a one time event 
rather than a value of a metric.

> On the API side, we need to be able to receive (a list of) tuples
> `< resource id, report time, state >`
> In case of full Hawkular, the _resource id_ needs to be a valid one from
> Inventory.

The inventory has metrics and resources decoupled from each other with just an 
m:n relationship between them. As such, I think inventory will just mark one 
of such relationships with an "avail" flag.

The tuples therefore will just be

`< avail_id, report_time, state >`

> _Report time_ is the local time on the resource / agent when that state
> was retrieved,
> represented in ms since the epoch UTC and
> then finally the _state_ which would be an Enum of "UP", "DOWN" and
> potentially some other
> values. While I have described them as string here, the representation
> on the wire may be
> implemented differently like 1 and 0 or true and false.
> 
> 
> == computed availability
> 
> In addition to above reporting we may have feeds that either are not
> able to deliver availability or
> where the availability is delivered as a numeric value - see e.g. the
> pinger, where a <rid>.status.code
> is delivered as metric value representing the http status codes.
> Here we need to be apply a mapping from return code -> availability.
> 
>      f(code) ->  code < 400 ? "UP" : "DOWN"
> 
> and then further proceed with that computed availability value.
>

Yes, exactly. Further to this, I think we may want (on case by case basis, I 
think) to actually NOT store the reported value but only store the computed 
one.

> See also [2] and [3]
> 
> === "Backfill"
> 
> As feeds may not report back all the time, we may want to have a
> watchdog which adds
> a transition into "UNKNOWN" state.
>

+1

> 
> === Admin-down
> 
> A feed may discover resources that report their state as DOWN but where
> this is not an issue and rather an
> administrative decision. Take a network card as example where the card
> as 8 ports, but only 4 of them
> are connected. So the other 4 will be reported as DOWN, but in fact they
> are DOWN on purpose.
> The admin may mark those interfaces as ADMIN_DOWN, which also implies
> that further incoming
> DOWN-reports (what about UP, UNKNOWN?) reports can be ignored until the
> admin re-enables the
> interface.
> This admin-down probably also needs to be marked in inventory.
>

I am not sure we even want to have ADMIN_DOWN as an explicit avail state. All 
we need to have is some kind of metric reporting that right now, a flag called 
ON_PURPOSE = true (this can be just another metric coming from an actual user 
for example or as you say it can be a property on a resource or even the avail 
metric itself in inventory (every entity in inventory can store arbitrary key-
value pairs)). 

We still can have avail UP or DOWN reported but because we know that this is 
ON_PURPOSE, alerts may choose to react differently for example.

> === Maintenance mode
> 
> On top of the availability we also have maintenance mode which is
> orthogonal to availability and is more meant for alert suppression and
> SLA computation. Maintenance mode should not overwrite the recorded or
> computed availability.
> We still want to record the original state no matter how maintenance
> mode is.
> 

+1

As with ADMIN_DOWN, this can be modeled in a number of ways.

> == Storage
> 
> As I wrote earlier, the base assumption is that availability is supposed
> to stay the same for
> long periods of time. For that reason run-length encoded storage is
> advised
> 
>      < resource id, state, from , to >
> 
> The fields are more or less self-explanatory - to would be "null" if the
> current state continues.
> 
> This is also sort of what we have done in RHQ, where we have also been
> running into some issues,
> (especially as we had a very db-bound approach). One issue is that if
> you have a transition from UP to DOWN
> 

This implies a read on each store of new incoming data. Logically, it makes 
sense, by I assume that internally, metrics will do this differently like for 
example actually store the raw values for a short-ish period of time and 
aggregate them to run-length encoded periodically.

> the DB situation looks like this:
> 
> Start:
>      <rid , UP,  from1 , null >
> 
> up-> Down at time = from2
> 
> find tuple <rid, ??, ??, null > and update to
>      <rid, UP, from1, from2>
> append new tuple
>       <rid, DOWN, from2, null>
> 

I think this is not precise. Let me illustrate on another example:

Incoming data:
1) <rid, t1, UP>
2) <rid, t2, UP>
3) <rid, t3, DOWN>

I think this should be encoded as:

<rid, UP, t1, t2>
<rid, UNKNOWN, t2, t3>
<rid, DOWN, t3, t3>

I.e. I would explicitly encode the information about the impossibility of 
knowing the precise time of going down. This also means that you know the last 
time you saw a resource up.

I would also not store the "null" for the last run-length encoded piece, 
because you actually don't know that the resource is still up. Instead if 
someone asks for avail at time t4 (before next avail report comes after t3), I 
would report "unknown, last seen "down" at t3).

(We could also just not store the UNKNOWN altogether and only interpret it 
from the lack of data for that period of time)

> The other issue is to get the current availability (for display in UI
> and/or in the previous transition)
> 
> find tuple <rid, ??, ??, null>
> 
> which are expensive.
> 
> The retrieval of the current availability for a resource can be improved
> by introducing a cache that stores
> as minimal information  <rid, last state>.
> 

+1

> Another issue that Yak pointed out is that if availability is recorded
> infrequently and at random points in time,
> just recording when a transition from UP to DOWN or even UNKNOWN
> happened may be not enough, as there are scenarios when it is still
> important to know when we heard the last UP report.
> 
> So above storage (and cache) tuple needs to be extended to contain the
> _last heard_ time:
> 
>      < resource id, state, from , to, last_head >
>

if we encode the transition from up to down as I outlined above, the last 
heard time is the start time of the unknown between the up and down.

> In this case, as we do not want to update that record for each incoming
> availability report, we need to really
> cache this information and have either some periodic write back to the
> store or at least when a shutdown listener indicates that Hawkular is
> going down. In case that we have multiple API endpoints that receive
> alert reports , this may need to be a distributed cache.
> 
> 
> == Retrieval
> 
> Retrieval of availability information may actually a bit more tricky as
> returning the current availability state,
> as there will be more information to convey:
> 
> We have two basic cases
> * return current availability / resource state : this can probably be
> answered directly from above mentioned cache
> * return a timeline between some arbitrary start and end times. Here we
> need to go out and return all records
> that satisfy something like ( start_time < requested start && end_time >
> requested start ) || (start_time > requested start && end_time <=
> requested_end  )
> 
> === application / group of resources
> 
> For applications the situation becomes more complicated as we need to
> retrieve the state (records) for each involved resource and then compute
> the total state of the application.
> 
> Take an app with a load balancer, 3 app servers and a DB then this
> computation may go like
> 
>      avail ( app  ) :=
>           UP  if all resources are UP
>           MIXED  if one app server  is not UP
>           DOWN otherwise
> 
> Actually this may even contain a time component
> 
>      avail ( app , time of day ) :=
>           if (business_hours (time of day) )
>               UP  if all resources are UP
>               MIXED  if one app server  is not UP
>               DOWN otherwise
>           else
>               UP  if all resources are UP
>               MIXED  if two app servers are not UP
>               DOWN otherwise
> 
> 
> It may be a good idea to not compute that on the fly at retrieval time,
> but to add the result as synthetic availability records for the
> computation into the normal availability processing stream as indicated
> earlier in the "computed availability" section. This way, the computed
> information is also available for alerting as input
>

+1

> == Alerting on availability
> 
> Alerting will need to see the (computed) availability data and also the
> maintenance mode information to be able to
> alert on
> * is UP/DOWN/...  ( for X time )
> * goes UP/DOWN/...
> 

I personally am a fan of avail being a percentage rather than a simple 
boolean. That makes the goes up/down less intuitive but if we understand the 
percentage as "health" it might not be that bad to understand:

alert on
   health < 50%

> With the above I think that alerting should not need to do complex
> availability calculations on its own, but rather
> work on the stream of incoming (compute
>

+1

> 
> [1] https://issues.jboss.org/browse/HWKMETRICS-35
> [2] http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000413.html
> [3] http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000402.html
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev