But in fact (and we were discussing that already) if the above url "ping" would
be done from two different sites (e.g. US and EU) and one would return 200 and the other a
timeout, then the real availability would be UP, as it is reachable (*1). Here a single
feed (pinger in one location) is no longer able to determine the availability alone.
Also it may not be enough to determine availability by status code alone, as a 200 after 2
minutes is for the end customer equivalent to down.
And then we found out in RHQ that just having availability states of "UP" and
"DOWN" are not enough, as individual resources may be down on purpose, the feed
may just not report anything. Or when you look at a group of resources (or composite
resource) like an application consisting of multiple services, the total availability of
my shop may be up, but degraded (e.g. slow response time). Or it may be up and fast, but
one of the 3 servers in the cluster is down .
This is why I am proposing a) to have a more differentiated set of "resource
state"s and b) to have this state being a function of several input parameters.
About a) this is a list of possible resource states, where UP and DOWN correspond to the
classical binary availability terms.
UP: Resource is available and working normally
DEGRADED: Resource is available but not at full performance
DOWN: Resource is at fault and not working normally
MAINTENANCE: There is a scheduled maintenance period, availability may be UP or DOWN
MISSING: The resource was recorded in inventory, but does not exist in reality (e.g. was
deleted on file system)
ADMIN_DOWN/DISABLED: The resource exists, but was disabled by the admin (e.g. a network
interface on a 8 port card where only 1 cable is connected)
UNKNOWN: Resource state can not be determined
Aggregated state
A state of “MIXED” can be added for groups or applications (e.g 3 servers in a cluster,
one server is down, 2 are up).
For groups, the aggregated state could be computed as follows, but see below
All UP: Group is UP
All DOWN: Group is DOWN
Otherwise: Group is MIXED
Wrt b) computation of state
For the example of the url ping, the resource state could be computed as
function(list< code, time >) {
result = down;
for (< code, time > ) {
if (code == 200 ) {
if (time < threshold ) {
return UP;
}
}
}
return DOWN
}
This is already sort of what alerting is doing partially right now, and we could use this
in a rectified way
[input values]----> [ resource state processor ] ---(+)
and then at the (+) point we expose the resource state to e.g. the UI and other services,
where one of the services is the alert engine
(+)----> [ alert engine ] ----> [ notification handlers ]
That decides upon the computed states if alerting needs to be done and in what way.
*1) Of course we still need to flag the timeout, as the timeout may have an impact on
customers being able to reach the shop.
Show replies by date