[Hawkular-dev] Computed resource state

Fri Mar 6 04:30:07 EST 2015

But in fact (and we were discussing that already) if the above url "ping" would be done from two different sites (e.g. US and EU) and one would return 200 and the other a timeout, then the real availability would be UP, as it is reachable (*1). Here a single feed (pinger in one location) is no longer able to determine the availability alone.

Also it may not be enough to determine availability by status code alone, as a 200 after 2 minutes is for the end customer equivalent to down.

And then we found out in RHQ that just having availability states of "UP" and "DOWN" are not enough, as individual resources may be down on purpose, the feed may just not report anything. Or when you look at a group of resources (or composite resource) like an application consisting of multiple services, the total availability of my shop may be up, but degraded (e.g. slow response time). Or it may be up and fast, but one of the 3 servers in the cluster is down .

This is why I am proposing a) to have a more differentiated set of "resource state"s and b) to have this state being a function of several input parameters. 

About a) this is a list of possible resource states, where UP and DOWN correspond to the classical binary availability terms.

UP: Resource is available and working normally
DEGRADED: Resource is available but not at full performance
DOWN: Resource is at fault and not working normally
MAINTENANCE: There is a scheduled maintenance period, availability may be UP or DOWN
MISSING: The resource was recorded in inventory, but does not exist in reality (e.g. was deleted on file system)
ADMIN_DOWN/DISABLED: The resource exists, but was disabled by the admin (e.g. a network interface on a 8 port card where only 1 cable is connected)
UNKNOWN: Resource state can not be determined

Aggregated state

A state of “MIXED” can be added for groups or applications (e.g 3 servers in a cluster, one server is down, 2 are up).
For groups, the aggregated state could be computed as follows, but see below
All UP: Group is UP
All DOWN: Group is DOWN
Otherwise: Group is MIXED

Wrt b) computation of state

For the example of the url ping, the resource state could be computed as 

function(list< code, time >) {
  result = down;
  for (< code, time > ) {
      if (code == 200 )  {
         if (time < threshold ) {
            return UP;
         }
     }
  }
  return DOWN
}

This is already sort of what alerting is doing partially right now, and we could use this in a rectified way

[input values]----> [  resource state processor ]  ---(+) 

and then at the (+) point we expose the resource state to e.g. the UI and other services,
where one of the services is the alert engine 

(+)----> [  alert engine  ] ---->  [ notification handlers ]

That decides upon the computed states if alerting needs to be done and in what way.

*1) Of course we still need to flag the timeout, as the timeout may have an impact on customers being able to reach the shop.