[Hawkular-dev] scope of the agent design

Fri Mar 13 04:28:58 EDT 2015

On 13.03.2015 09:57, Thomas Heute wrote:
> For uptime/downtime for instance, the embedded in-process, can only say
> "I'm up", if not up -> it's down or unknown (network issue ?). From a
> separate process you can tell if it's down, separate process is better
> in that case (but it may not be installed, so do we fallback on embedded
> process info ?).
The network issue will affect the embedded and process agent just as 
likely, there's no difference in that case. Also, separate process can't 
really tell if something is down or not, the example mentioned here was 
that the CPU of the process is overloaded and can't report anything but 
the system agent can see the pid is up. This definitely is not the case 
usually, having a PID up doesn't mean the software is alive anymore (who 
here hasn't booted their Cassandra with kill -9 more than once?).

As an example of agent system, BMC's Control-M handles this differently 
than we're planning. While it's a job monitoring system, it has two 
statuses, agent status and job status. If the agent can not be contacted 
for certain period of time, the agent is marked down (and alerted), 
while the job itself is marked with unknown state. If the agent is up, 
but can't read the job's status, job is again marked unknown. Only if it 
really knows something has failed, the job is marked as failed.

> Agent-less works *if* the network is open enough to allow it...
>
> Also for embedded one, we may be bound to product releases, unless we
> instrument ourself the server and update as we wish.
>
I don't think the product releases are an issue, if we have working 
versioning in our APIs. We should just log the version of the agents in 
our UI. As long as the API versioning is handled correctly, we should 
able to support older versions quite fine. Sure, those services wouldn't 
get new features, but I don't think this is an issue, if the new 
features are marked as product features (eg. EAP monitoring features) 
instead of our updated features.

My wish is that we would support multiple approaches. If we have 
platform agent, it should take care of connecting to this product 
'agents' and dispatching stuff to the server, but otherwise there could 
be smaller agents doing just a simple job. It's not unheard of that some 
enterprise wants to monitor infra with different tools than what's used 
to monitor running applications, as they could be monitored by different 
departments and with different responsibilities. For containers, we'll 
probably want to do something like cAdvisor, so running a container 
monitoring other containers.

Just one wish - very low overhead agent. From my past we had a test 
machine with slightly reduced resources, yet the infra installed first 
IBM's TSM agent (Storage Manager) that had increasing memory usage based 
on how many files there were, then BMC's Patrol with plugins for TSM etc 
and of course we needed CTM agent to run our jobs. In the end the 
machine couldn't actually run any tests because it had no memory left 
for those jobs. Sounds like horror story but it's very real (in the end 
I solved the issue by killing every agent except CTM before running the 
tests and then restarting them after the tests as I knew the root 
password and infra's SLA).

   -  Micke