[Hawkular-dev] Availability (Service and UI)

Mon Feb 16 04:49:12 EST 2015

On 02/13/2015 11:18 PM, Michael Burman wrote:
> Hi,
>
> I disagree, seeing this pattern first hand in resulting large penalty
> payment on one case (inability to prove that monitoring reported up on
> certain time when it wasn't responsive). If there's a need to monitor
> availability, there must be recorded event flow, eg:
>
> Up 11:00
> Up 11:05
> Up 11:10
> Down 11:15
> Down 11:20
> Up 11:25
>
> And why? If we only record:
>
> Up 11:00 - 11:10
> Down 11:15 - 11:20
> Up 11:25 - ..
>
> There's a difference. Did the monitoring system report 11:05 or was the
> monitoring system down 11:05? In many cases a need to prove that system
> was in certain state in certain times. Especially when there are SLA
> disagreements. Even more importantly, what happened between 11:10 -
> 11:15 and 11:20 - 11:25?

Recording all datapoints doesn't give you more information on what 
happened between 2 recordings though.

But I get your point on the "inability to prove that monitoring reported 
up on certain time when it wasn't responsive".
I didn't see availability recording as SLA proofs though.

> The SLA behaviour requires that some system noticed something happening
> at certain point of time. And if there's issues with system
> responsibility say at 11:05, there needs to be a event showing that the
> system really did report up.

You can't prove that system is up and running at a special point in 
time, you can only prove over a period of time if you made enough checks 
(interval lesser than the period of time of the SLA).

Thomas

>
>     - Micke
>
>
> On 13.02.2015 14:01, Thomas Heute wrote:
>>
>> Getting back to availability discussion...
>>
>> To me availability is a set of periods, not so much "time series" and
>> we should just record change of status (closing the previous event and
>> opening a new one).
>>
>>     - Server is up from 8:00am to 11:30am
>>     - Server is down from 11:30am to 11:32am
>>     - Server is unknown from 11:32am to 12:00pm (an agent running on a
>> machine can tell if a server is up or down, if the agent dies then we
>> don't know if the server is up or down)
>>     - Server is in erratic state from 12:00pm to 12:30pm (agent
>> reports down every few requests)
>>
>> We were discussing the best way to represent availability over time in
>> a graph, representation in RHQ [1] is very decent IMO, can be extended
>> with more colors to reflect how often/long the website was down for
>> each "brick" (if the line represent a year with 52 blocks, 1 block can
>> be more or less red depending on how long it was done during the week).
>>
>> But thinking of it more, availability graph is not that interesting by
>> itself IMO and more interesting in the context of other values.
>> I attached a mockup of what I meant, a red area is displayed on
>> response time graph, that means that the system is down, obviously
>> there is no response time reported anymore in that period. Earlier
>> there is an erratic area, seems related to higher response time ;)
>> Rest of the time the system is just up and running...
>>
>> Additionally I would want to see reports of availability:
>>     - overall availability over a period of time (a day, a month, a
>> year...). "99.99% available in the past month"
>>     - lists of the down periods with start dates and duration for a
>> particular resource or set of resources (filtering options)
>>
>> Thoughts ?
>>
>> [1]
>> http://3.bp.blogspot.com/-0MsmG5h5i5E/TfjTMZlvx3I/AAAAAAAAABU/6PKDs0RlzuI/s1600/ProblemManagement-RHQ.png
>>
>> Thomas
>>
>>
>> _______________________________________________
>> hawkular-dev mailing list
>> hawkular-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>
>
>
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev
>