[metrics] Consistent error replies from REST API
by Thomas Segismont
Hi,
There's a growing portion of application code in the metrics REST API
which returns errors (with whichever HTTP status) in the following form:
====
{"errorMsg": "blah"}
====
Whether this format is nice or suitable is not the purpose of this
message. I just want to draw your attention to the fact that being
consistent regarding error reporting is important.
In this regard, I've added a new ExceptionMapper for
org.jboss.resteasy.spi.ReaderException[2] instances. So now, not only
application errors conform to the format above, but also errors coming
from payload parsing.
Regards,
Thomas
[1]
http://docs.jboss.org/resteasy/docs/3.0.9.Final/userguide/html_single/#bu...
9 years, 9 months
Computed resource state
by Heiko W.Rupp
But in fact (and we were discussing that already) if the above url "ping" would be done from two different sites (e.g. US and EU) and one would return 200 and the other a timeout, then the real availability would be UP, as it is reachable (*1). Here a single feed (pinger in one location) is no longer able to determine the availability alone.
Also it may not be enough to determine availability by status code alone, as a 200 after 2 minutes is for the end customer equivalent to down.
And then we found out in RHQ that just having availability states of "UP" and "DOWN" are not enough, as individual resources may be down on purpose, the feed may just not report anything. Or when you look at a group of resources (or composite resource) like an application consisting of multiple services, the total availability of my shop may be up, but degraded (e.g. slow response time). Or it may be up and fast, but one of the 3 servers in the cluster is down .
This is why I am proposing a) to have a more differentiated set of "resource state"s and b) to have this state being a function of several input parameters.
About a) this is a list of possible resource states, where UP and DOWN correspond to the classical binary availability terms.
UP: Resource is available and working normally
DEGRADED: Resource is available but not at full performance
DOWN: Resource is at fault and not working normally
MAINTENANCE: There is a scheduled maintenance period, availability may be UP or DOWN
MISSING: The resource was recorded in inventory, but does not exist in reality (e.g. was deleted on file system)
ADMIN_DOWN/DISABLED: The resource exists, but was disabled by the admin (e.g. a network interface on a 8 port card where only 1 cable is connected)
UNKNOWN: Resource state can not be determined
Aggregated state
A state of “MIXED” can be added for groups or applications (e.g 3 servers in a cluster, one server is down, 2 are up).
For groups, the aggregated state could be computed as follows, but see below
All UP: Group is UP
All DOWN: Group is DOWN
Otherwise: Group is MIXED
Wrt b) computation of state
For the example of the url ping, the resource state could be computed as
function(list< code, time >) {
result = down;
for (< code, time > ) {
if (code == 200 ) {
if (time < threshold ) {
return UP;
}
}
}
return DOWN
}
This is already sort of what alerting is doing partially right now, and we could use this in a rectified way
[input values]----> [ resource state processor ] ---(+)
and then at the (+) point we expose the resource state to e.g. the UI and other services,
where one of the services is the alert engine
(+)----> [ alert engine ] ----> [ notification handlers ]
That decides upon the computed states if alerting needs to be done and in what way.
*1) Of course we still need to flag the timeout, as the timeout may have an impact on customers being able to reach the shop.
9 years, 9 months
Availability vs uptime for URL "ping"
by Thomas Heute
In the most simplistic form of monitoring we're looking at pinging
website and report up/down and response time from the initiator of the
HTTP HEAD/GET
We've discussed a bit about availability vs uptime.
First question: Do we need to distinguish ? Is it important for someone
who wants to know if his website is accessible to really separate the 2
concepts. (Details vs simplicity)
Second question: If we separate the 2, how do we do distinguish ? A
suggestion:
* HTTP Code 2xx and 3xx -> URL is up and available
* HTTP Code 4xx -> The server may reject the request (it may not like
bots, user entered a wrong url (should be checked upfront), or resource
has been deleted)... Server is up, availability is unknown
* HTTP Code 5xx -> URL is up but not available
* Timeout -> URL is down and not available
Couldn't resolve host 'www.fffffffffefwefdwdf.com' -> Domain name is
deleted: URL is down and not available
4xx is likely the most debatable, it's a client issue and likely needs
either code fix or user intervention... (And we can't unfortunately
expect servers to fully respect the codes)
Thoughts ?
Thomas
9 years, 9 months
infinispan + cassandra
by John Mazzitelli
I'm looking at Infinispan and how it can be configured in Wildfly (there was talk about us needing a clustered cache in the kettle, so this is what started me looking at this).
Since we already have C* in use by metrics, and there is talk that inventory is going to use C* under the covers, I was wondering what people thought about utlizing C* as the persistent backend for infinispan:
http://infinispan.org/docs/cachestores/cassandra/
This provides persistence across a clustered infinispan cache.
Is this something we'd be interested in or would want to use?
9 years, 9 months
Forecasting of bad things
by Heiko W. Rupp
Something I was talking about for a while ...
I saw this today in Android 5 - they give you (when the device is running on battery) not only a
chart about past battery consumption (dark green), but also about estimated future depletion (grey-green)
and on top of the chart the current state (40%) + an estimate that the battery will last ~4 more days at the
same usage level.
--
Heiko Rupp hwr(a)pilhuhn.de
Blog: http://pilhuhn.blogspot.com @pilhuhn
9 years, 9 months
Notification messages
by Gary Brown
Hi
Started looking at Hawkular alerts with an eye on RTGov eventually using the notification mechanism to represent what we currently store/display as 'situations'. Had a couple of questions:
1) The notification message currently has a notifierId - so does this mean an alert trigger will only have a single notifer? Or could a single alert (notification message) potentially be sent to multiple notifiers (e.g. twitter, email, etc.) - and if so require a notifierId list?
2) Only other field is currently a description, which is fine for targets such as email, twitter etc, but in RTGov 'situations' are also used to hold other information that can be used to understand the source of the problem, and tie it back to the originating business transaction. Will it be possible to add such fields to the notification message, even though they may not be relevant for the email/sms/twitter type notifiers?
Regards
Gary
9 years, 10 months