Am 17.02.2015 um 10:21 schrieb Thomas Heute
<theute(a)redhat.com>:
IMHO typically one would be happy with a check every 20s/30s or even a minute if the cost
is "low", ie: the cost of the monitoring infrastructure is a small fraction of
the infrastructure itself.
We need to consider this at different levels.
An agent may be able to determine process availability every second cheaply. This does not
imply
that it needs to forward that information every second to the server. And even if it did,
it does not
imply that we need to store a data point for every second.
One thing we did never really fit into classical RHQ is caching.
A Map<Long,Byte> for the last known state of a resource (*) can easily store the
last known
availability state for a huge number of resources. This way we would not need to query
the
database for each incoming point, but we'd look up the value in the cache and only
store
when it differs.
Needs a bit more investigating + modelling to expire entries in the cache that do not get
a
data point every now and then and inform other subsystems (alerting) - that is not too
heavy to implement.
The agent can do similar caching for its resources and only forwarding when the state
changes.
This way we can scale to a lot of checks agent side without overwhelming server and data
store.
*) Yes I know we are currently using string IDs in many places. We may reconsider that for
resource ids.