[Hawkular-dev] Availability (Service and UI)

Fri Feb 13 17:18:37 EST 2015

Hi,

I disagree, seeing this pattern first hand in resulting large penalty 
payment on one case (inability to prove that monitoring reported up on 
certain time when it wasn't responsive). If there's a need to monitor 
availability, there must be recorded event flow, eg:

Up 11:00
Up 11:05
Up 11:10
Down 11:15
Down 11:20
Up 11:25

And why? If we only record:

Up 11:00 - 11:10
Down 11:15 - 11:20
Up 11:25 - ..

There's a difference. Did the monitoring system report 11:05 or was the 
monitoring system down 11:05? In many cases a need to prove that system 
was in certain state in certain times. Especially when there are SLA 
disagreements. Even more importantly, what happened between 11:10 - 
11:15 and 11:20 - 11:25?

The SLA behaviour requires that some system noticed something happening 
at certain point of time. And if there's issues with system 
responsibility say at 11:05, there needs to be a event showing that the 
system really did report up.

    - Micke

On 13.02.2015 14:01, Thomas Heute wrote:
>
> Getting back to availability discussion...
>
> To me availability is a set of periods, not so much "time series" and 
> we should just record change of status (closing the previous event and 
> opening a new one).
>
>     - Server is up from 8:00am to 11:30am
>     - Server is down from 11:30am to 11:32am
>     - Server is unknown from 11:32am to 12:00pm (an agent running on a 
> machine can tell if a server is up or down, if the agent dies then we 
> don't know if the server is up or down)
>     - Server is in erratic state from 12:00pm to 12:30pm (agent 
> reports down every few requests)
>
> We were discussing the best way to represent availability over time in 
> a graph, representation in RHQ [1] is very decent IMO, can be extended 
> with more colors to reflect how often/long the website was down for 
> each "brick" (if the line represent a year with 52 blocks, 1 block can 
> be more or less red depending on how long it was done during the week).
>
> But thinking of it more, availability graph is not that interesting by 
> itself IMO and more interesting in the context of other values.
> I attached a mockup of what I meant, a red area is displayed on 
> response time graph, that means that the system is down, obviously 
> there is no response time reported anymore in that period. Earlier 
> there is an erratic area, seems related to higher response time ;) 
> Rest of the time the system is just up and running...
>
> Additionally I would want to see reports of availability:
>     - overall availability over a period of time (a day, a month, a 
> year...). "99.99% available in the past month"
>     - lists of the down periods with start dates and duration for a 
> particular resource or set of resources (filtering options)
>
> Thoughts ?
>
> [1] 
> http://3.bp.blogspot.com/-0MsmG5h5i5E/TfjTMZlvx3I/AAAAAAAAABU/6PKDs0RlzuI/s1600/ProblemManagement-RHQ.png
>
> Thomas
>
>
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/hawkular-dev/attachments/20150214/7bf5e408/attachment-0001.html