Re: [Hawkular-dev] scope of the agent design

Friday, 13 March 2015

On 03/13/2015 09:28 AM, Michael Burman wrote:
...
 On 13.03.2015 09:57, Thomas Heute wrote:
> For uptime/downtime for instance, the embedded in-process, can only say
> "I'm up", if not up -> it's down or unknown (network issue ?).
From a
> separate process you can tell if it's down, separate process is better
> in that case (but it may not be installed, so do we fallback on embedded
> process info ?).
 The network issue will affect the embedded and process agent just as
 likely, there's no difference in that case. The difference is that:
  - if embedded, it can only say "I'm up", "I'm up',
"I'm up", if you 
receive nothing you can't differentiate from being down or not receiving 
the up message
  - if non embeeded, it can say "Resource is up", "Resource is down",
if 
you receive nothing you can tell that the agent is messed up, the state 
of the resource is unknown.

...
   Also, separate process can't
 really tell if something is down or not, the example mentioned here was
 that the CPU of the process is overloaded and can't report anything but
 the system agent can see the pid is up. This definitely is not the case
 usually, having a PID up doesn't mean the software is alive anymore (who
 here hasn't booted their Cassandra with kill -9 more than once?). There are
various way to say if something is down, we can find many 
exceptions but let's not the exceptions state the rule.

If there is no process running an agent can tell that it is down, but 
the process can't tell itself.

...
 As an example of agent system, BMC's Control-M handles this
differently
 than we're planning. While it's a job monitoring system, it has two
 statuses, agent status and job status. If the agent can not be contacted
 for certain period of time, the agent is marked down (and alerted),
 while the job itself is marked with unknown state. If the agent is up,
 but can't read the job's status, job is again marked unknown. Only if it
 really knows something has failed, the job is marked as failed. So they still have
an agent
...

> Agent-less works *if* the network is open enough to allow it...
>
> Also for embedded one, we may be bound to product releases, unless we
> instrument ourself the server and update as we wish.
>
 I don't think the product releases are an issue, if we have working
 versioning in our APIs. We should just log the version of the agents in
 our UI. As long as the API versioning is handled correctly, we should
 able to support older versions quite fine. Sure, those services wouldn't
 get new features, but I don't think this is an issue, if the new
 features are marked as product features (eg. EAP monitoring features)
 instead of our updated features. I raised it because it has been a big issue for
JON updates. So if not a 
blocker it is a concern and need to be taken into account.

...
 My wish is that we would support multiple approaches. We need
to be careful, multiple approaches = increased QE and since this 
is not infinite this leads to decreased quality.
...
   If we have
 platform agent, it should take care of connecting to this product
 'agents' and dispatching stuff to the server, but otherwise there could
 be smaller agents doing just a simple job. It's not unheard of that some
 enterprise wants to monitor infra with different tools than what's used
 to monitor running applications, as they could be monitored by different
 departments and with different responsibilities. For containers, we'll
 probably want to do something like cAdvisor, so running a container
 monitoring other containers. I am not sure to understand. In any case
"smaller" agents need to be 
easily manageable.
...
 Just one wish - very low overhead agent. From my past we had a test
 machine with slightly reduced resources, yet the infra installed first
 IBM's TSM agent (Storage Manager) that had increasing memory usage based
 on how many files there were, then BMC's Patrol with plugins for TSM etc
 and of course we needed CTM agent to run our jobs. In the end the
 machine couldn't actually run any tests because it had no memory left
 for those jobs. Sounds like horror story but it's very real (in the end
 I solved the issue by killing every agent except CTM before running the
 tests and then restarting them after the tests as I knew the root
 password and infra's SLA). I think we'll all agree on the req, but we need
to agree on what that 
means for the implementation...
There will be a tradeoff to make between low overhead and some 
capabilities (like generating alerts itself)... Will likely need to be 
configurable...

Thomas
...

     -  Micke
 _______________________________________________
 hawkular-dev mailing list
 hawkular-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/hawkular-dev 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Hawkular-dev] scope of the agent design