On 03/13/2015 09:28 AM, Michael Burman wrote:
On 13.03.2015 09:57, Thomas Heute wrote:
> For uptime/downtime for instance, the embedded in-process, can only say
> "I'm up", if not up -> it's down or unknown (network issue ?).
From a
> separate process you can tell if it's down, separate process is better
> in that case (but it may not be installed, so do we fallback on embedded
> process info ?).
The network issue will affect the embedded and process agent just as
likely, there's no difference in that case.
The difference is that:
- if embedded, it can only say "I'm up", "I'm up',
"I'm up", if you
receive nothing you can't differentiate from being down or not receiving
the up message
- if non embeeded, it can say "Resource is up", "Resource is down",
if
you receive nothing you can tell that the agent is messed up, the state
of the resource is unknown.
Also, separate process can't
really tell if something is down or not, the example mentioned here was
that the CPU of the process is overloaded and can't report anything but
the system agent can see the pid is up. This definitely is not the case
usually, having a PID up doesn't mean the software is alive anymore (who
here hasn't booted their Cassandra with kill -9 more than once?).
There are
various way to say if something is down, we can find many
exceptions but let's not the exceptions state the rule.
If there is no process running an agent can tell that it is down, but
the process can't tell itself.
As an example of agent system, BMC's Control-M handles this
differently
than we're planning. While it's a job monitoring system, it has two
statuses, agent status and job status. If the agent can not be contacted
for certain period of time, the agent is marked down (and alerted),
while the job itself is marked with unknown state. If the agent is up,
but can't read the job's status, job is again marked unknown. Only if it
really knows something has failed, the job is marked as failed.
So they still have
an agent
> Agent-less works *if* the network is open enough to allow it...
>
> Also for embedded one, we may be bound to product releases, unless we
> instrument ourself the server and update as we wish.
>
I don't think the product releases are an issue, if we have working
versioning in our APIs. We should just log the version of the agents in
our UI. As long as the API versioning is handled correctly, we should
able to support older versions quite fine. Sure, those services wouldn't
get new features, but I don't think this is an issue, if the new
features are marked as product features (eg. EAP monitoring features)
instead of our updated features.
I raised it because it has been a big issue for
JON updates. So if not a
blocker it is a concern and need to be taken into account.
My wish is that we would support multiple approaches.
We need
to be careful, multiple approaches = increased QE and since this
is not infinite this leads to decreased quality.
If we have
platform agent, it should take care of connecting to this product
'agents' and dispatching stuff to the server, but otherwise there could
be smaller agents doing just a simple job. It's not unheard of that some
enterprise wants to monitor infra with different tools than what's used
to monitor running applications, as they could be monitored by different
departments and with different responsibilities. For containers, we'll
probably want to do something like cAdvisor, so running a container
monitoring other containers.
I am not sure to understand. In any case
"smaller" agents need to be
easily manageable.
Just one wish - very low overhead agent. From my past we had a test
machine with slightly reduced resources, yet the infra installed first
IBM's TSM agent (Storage Manager) that had increasing memory usage based
on how many files there were, then BMC's Patrol with plugins for TSM etc
and of course we needed CTM agent to run our jobs. In the end the
machine couldn't actually run any tests because it had no memory left
for those jobs. Sounds like horror story but it's very real (in the end
I solved the issue by killing every agent except CTM before running the
tests and then restarting them after the tests as I knew the root
password and infra's SLA).
I think we'll all agree on the req, but we need
to agree on what that
means for the implementation...
There will be a tradeoff to make between low overhead and some
capabilities (like generating alerts itself)... Will likely need to be
configurable...
Thomas
- Micke
_______________________________________________
hawkular-dev mailing list
hawkular-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/hawkular-dev