[Hawkular-dev] RfC: Availability

Fri Mar 20 06:20:17 EDT 2015

Hey,

there was apparently some watercooler discussion yesterday without any 
minutes, so the following
will not be able to refer to it in any way.

Hawkular needs to have a way to store, retrieve and display availability 
of a resource or a bunch of them [1].

While we have some short term goals, for the longer run we need to 
better identify what needs to be done.
I think we need to separately look at the following concerns:

* availability reporting
   * api
   *values
* availability computation
* availability storage
* availability retrieval
* alerting on availability
* computed resource state

The basic assumption here is that availability is something relatively 
stable. Meaning that usually
the same state (hopefully "UP") is reported each time in a row for a 
veeery long period of time (there
are some servers with uptimes >> 1 year).

== reporting

Feeds report availability to Hawkular, where the data may be further 
processed and stored.
The reported values are probably in the range of "UP", "DOWN". I can 
also imagine that e.g.
an application server that starts shutting down could send a 
"GOING_DOWN" value.

On the API side, we need to be able to receive (a list of) tuples
`< resource id, report time, state >`
In case of full Hawkular, the _resource id_ needs to be a valid one from 
Inventory.
_Report time_ is the local time on the resource / agent when that state 
was retrieved,
represented in ms since the epoch UTC and
then finally the _state_ which would be an Enum of "UP", "DOWN" and 
potentially some other
values. While I have described them as string here, the representation 
on the wire may be
implemented differently like 1 and 0 or true and false.

== computed availability

In addition to above reporting we may have feeds that either are not 
able to deliver availability or
where the availability is delivered as a numeric value - see e.g. the 
pinger, where a <rid>.status.code
is delivered as metric value representing the http status codes.
Here we need to be apply a mapping from return code -> availability.

     f(code) ->  code < 400 ? "UP" : "DOWN"

and then further proceed with that computed availability value.

See also [2] and [3]

=== "Backfill"

As feeds may not report back all the time, we may want to have a 
watchdog which adds
a transition into "UNKNOWN" state.

=== Admin-down

A feed may discover resources that report their state as DOWN but where 
this is not an issue and rather an
administrative decision. Take a network card as example where the card 
as 8 ports, but only 4 of them
are connected. So the other 4 will be reported as DOWN, but in fact they 
are DOWN on purpose.
The admin may mark those interfaces as ADMIN_DOWN, which also implies 
that further incoming
DOWN-reports (what about UP, UNKNOWN?) reports can be ignored until the 
admin re-enables the
interface.
This admin-down probably also needs to be marked in inventory.

=== Maintenance mode

On top of the availability we also have maintenance mode which is 
orthogonal to availability and is more meant for alert suppression and 
SLA computation. Maintenance mode should not overwrite the recorded or 
computed availability.
We still want to record the original state no matter how maintenance 
mode is.

== Storage

As I wrote earlier, the base assumption is that availability is supposed 
to stay the same for
long periods of time. For that reason run-length encoded storage is 
advised

     < resource id, state, from , to >

The fields are more or less self-explanatory - to would be "null" if the 
current state continues.

This is also sort of what we have done in RHQ, where we have also been 
running into some issues,
(especially as we had a very db-bound approach). One issue is that if 
you have a transition from UP to DOWN

the DB situation looks like this:

Start:
     <rid , UP,  from1 , null >

up-> Down at time = from2

find tuple <rid, ??, ??, null > and update to
     <rid, UP, from1, from2>
append new tuple
      <rid, DOWN, from2, null>

The other issue is to get the current availability (for display in UI 
and/or in the previous transition)

find tuple <rid, ??, ??, null>

which are expensive.

The retrieval of the current availability for a resource can be improved 
by introducing a cache that stores
as minimal information  <rid, last state>.

Another issue that Yak pointed out is that if availability is recorded 
infrequently and at random points in time,
just recording when a transition from UP to DOWN or even UNKNOWN 
happened may be not enough, as there are scenarios when it is still 
important to know when we heard the last UP report.

So above storage (and cache) tuple needs to be extended to contain the 
_last heard_ time:

     < resource id, state, from , to, last_head >

In this case, as we do not want to update that record for each incoming 
availability report, we need to really
cache this information and have either some periodic write back to the 
store or at least when a shutdown listener indicates that Hawkular is 
going down. In case that we have multiple API endpoints that receive 
alert reports , this may need to be a distributed cache.

== Retrieval

Retrieval of availability information may actually a bit more tricky as 
returning the current availability state,
as there will be more information to convey:

We have two basic cases
* return current availability / resource state : this can probably be 
answered directly from above mentioned cache
* return a timeline between some arbitrary start and end times. Here we 
need to go out and return all records
that satisfy something like ( start_time < requested start && end_time > 
requested start ) || (start_time > requested start && end_time <= 
requested_end  )

=== application / group of resources

For applications the situation becomes more complicated as we need to 
retrieve the state (records) for each involved resource and then compute 
the total state of the application.

Take an app with a load balancer, 3 app servers and a DB then this 
computation may go like

     avail ( app  ) :=
          UP  if all resources are UP
          MIXED  if one app server  is not UP
          DOWN otherwise

Actually this may even contain a time component

     avail ( app , time of day ) :=
          if (business_hours (time of day) )
              UP  if all resources are UP
              MIXED  if one app server  is not UP
              DOWN otherwise
          else
              UP  if all resources are UP
              MIXED  if two app servers are not UP
              DOWN otherwise

It may be a good idea to not compute that on the fly at retrieval time, 
but to add the result as synthetic availability records for the 
computation into the normal availability processing stream as indicated 
earlier in the "computed availability" section. This way, the computed 
information is also available for alerting as input

== Alerting on availability

Alerting will need to see the (computed) availability data and also the 
maintenance mode information to be able to
alert on
* is UP/DOWN/...  ( for X time )
* goes UP/DOWN/...

With the above I think that alerting should not need to do complex 
availability calculations on its own, but rather
work on the stream of incoming (compute

[1] https://issues.jboss.org/browse/HWKMETRICS-35
[2] http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000413.html
[3] http://lists.jboss.org/pipermail/hawkular-dev/2015-March/000402.html