[Hawkular-dev] scope of the agent design

Fri Mar 13 06:02:42 EDT 2015

On 03/13/2015 09:28 AM, Michael Burman wrote:
> On 13.03.2015 09:57, Thomas Heute wrote:
>> For uptime/downtime for instance, the embedded in-process, can only say
>> "I'm up", if not up -> it's down or unknown (network issue ?). From a
>> separate process you can tell if it's down, separate process is better
>> in that case (but it may not be installed, so do we fallback on embedded
>> process info ?).
> The network issue will affect the embedded and process agent just as
> likely, there's no difference in that case.
The difference is that:
  - if embedded, it can only say "I'm up", "I'm up', "I'm up", if you 
receive nothing you can't differentiate from being down or not receiving 
the up message
  - if non embeeded, it can say "Resource is up", "Resource is down", if 
you receive nothing you can tell that the agent is messed up, the state 
of the resource is unknown.

>   Also, separate process can't
> really tell if something is down or not, the example mentioned here was
> that the CPU of the process is overloaded and can't report anything but
> the system agent can see the pid is up. This definitely is not the case
> usually, having a PID up doesn't mean the software is alive anymore (who
> here hasn't booted their Cassandra with kill -9 more than once?).
There are various way to say if something is down, we can find many 
exceptions but let's not the exceptions state the rule.

If there is no process running an agent can tell that it is down, but 
the process can't tell itself.

> As an example of agent system, BMC's Control-M handles this differently
> than we're planning. While it's a job monitoring system, it has two
> statuses, agent status and job status. If the agent can not be contacted
> for certain period of time, the agent is marked down (and alerted),
> while the job itself is marked with unknown state. If the agent is up,
> but can't read the job's status, job is again marked unknown. Only if it
> really knows something has failed, the job is marked as failed.
So they still have an agent
>
>> Agent-less works *if* the network is open enough to allow it...
>>
>> Also for embedded one, we may be bound to product releases, unless we
>> instrument ourself the server and update as we wish.
>>
> I don't think the product releases are an issue, if we have working
> versioning in our APIs. We should just log the version of the agents in
> our UI. As long as the API versioning is handled correctly, we should
> able to support older versions quite fine. Sure, those services wouldn't
> get new features, but I don't think this is an issue, if the new
> features are marked as product features (eg. EAP monitoring features)
> instead of our updated features.
I raised it because it has been a big issue for JON updates. So if not a 
blocker it is a concern and need to be taken into account.

> My wish is that we would support multiple approaches.
We need to be careful, multiple approaches = increased QE and since this 
is not infinite this leads to decreased quality.
>   If we have
> platform agent, it should take care of connecting to this product
> 'agents' and dispatching stuff to the server, but otherwise there could
> be smaller agents doing just a simple job. It's not unheard of that some
> enterprise wants to monitor infra with different tools than what's used
> to monitor running applications, as they could be monitored by different
> departments and with different responsibilities. For containers, we'll
> probably want to do something like cAdvisor, so running a container
> monitoring other containers.
I am not sure to understand. In any case "smaller" agents need to be 
easily manageable.
> Just one wish - very low overhead agent. From my past we had a test
> machine with slightly reduced resources, yet the infra installed first
> IBM's TSM agent (Storage Manager) that had increasing memory usage based
> on how many files there were, then BMC's Patrol with plugins for TSM etc
> and of course we needed CTM agent to run our jobs. In the end the
> machine couldn't actually run any tests because it had no memory left
> for those jobs. Sounds like horror story but it's very real (in the end
> I solved the issue by killing every agent except CTM before running the
> tests and then restarting them after the tests as I knew the root
> password and infra's SLA).
I think we'll all agree on the req, but we need to agree on what that 
means for the implementation...
There will be a tradeoff to make between low overhead and some 
capabilities (like generating alerts itself)... Will likely need to be 
configurable...

Thomas
>
>     -  Micke
> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev