[Hawkular-dev] scope of the agent design

Wed Mar 11 17:27:11 EDT 2015

On Wednesday, March 11, 2015 16:12:52 John Mazzitelli wrote:
> > > So, what do we plan on having the agent do? What is its scope?
> > 
> > I take it that the agent is a "metrics collector". In my mind, it has
> > no power to influence the state of the application or machine: it just
> > collects and reports.
> 
> There was talk some time ago to perhaps think about the agent also doing
> some preliminary alert processing (which brings up more questions - for
> example, how does the agent get alert definitions from the server?)

Alerting has a CRUD interface for it, doesn't it? Since we assume only agent-
>server traffic, the agent will have to know the server's address one way or 
another.

Also, I am not sure that we want to consider the agent a "monolith" doing 
"agenty" stuff and *also* something else like alert preprocessing. For the 
agent, I think we should too think about how we could slice it into individual 
(semi-)independent pieces that could potentially live on their own (imagine 
this topology: agent -> preprocessor -> inventory
                                     -> metrics
                                     -> alerting).

> In
> addition, are we going to have the agent be able to perform operations such
> as "start" or "stop" or "restart" on managed resources? Can an agent
> reconfigure a managed resource?
>

Yeah, I am not sure if operations and config management have been scoped in. 
If they end up in, though, agent is an active player in the environment.

> So an agent is potentially more than just a "metrics collector". If an agent
> is going to be nothing more than a metrics collector, that would be
> something important to decide now because the agent will be much easier to
> implement if so.

+1.

Also, because the agent's metrics collection is inherently polling-based, we 
need to figure how we are going to define how often the metrics are going to 
be collected.

> > Collecting to me means: take a snapshot of the moment and store it
> > somewhere locally.
> > 
> > Reporting to me means: take the stored snapshots and send it to a
> > server somewhere. Once the server ACKs that it received, those
> > snapshots are then removed from the local storage.
> 
> This is roughly how we do envision this. Collecting means spooling the
> metrics until its time to send the data to the server (what you call
> "reporting"). I originally envisioned this as being metric messages we send
> to the server over the bus - but since it appears no one likes having a bus
> infrastructure as the core messaging infrastructure between components, it
> would have to be a REST request (with a load balancer in front to support
> scalability :-D). But that's getting into implementation details.
> > > How much knowledge of the inventory does the agent need to know?
> > > How does it gain this knowledge of its inventory?
> > 
> > Is "inventory" also a definition from RHQ? Or can I assume that
> > "inventory" is "a thing that can be measured"? Like, "CPU" is an item
> > on the inventory, as well as "Memory" and "Web Application 'Acme
> > Travel System'".
> 
> Yes, "inventory" is a definition we used from RHQ-land.
> 

Also, it is a name of a lesser-known Hawkular component ;)

> Inventory just means those resources that are known to be installed/running
> and are being managed/monitored by Hawkular. They can be CPUs, file
> systems, or things like EAP instances, or other running products like that.
> 
> It can get very complicated to ensure agents synchronize the known inventory
> that the server has.

I don't think this needs to be too complex in Hawkular though. The agent just 
needs to be told what NOT to collect.

> > > Does the agent auto-discover resources? How?
> > 
> > Yes.

In RHQ ;) It is both a cool feature and a curse because discovery can 
literally take the machine to its knees with RHQ agent and extremely "densely 
populated" machines. We used to have problems with EAP5 for which the 
discovery was so heavyweight (not RHQ's fault actually, but hey) that we could 
spin the CPU on 100% for tens of minutes just to discover the 20 EAP5s running 
on the box.

> > 
> > For web applications, via a Java Agent, as this would allow to detect
> > "everything" that runs on the JVM. With that, it would be possible to
> > detect deployments and its capabilities (JAX-RS, EJBs, Batch jobs,
> > ...) and apply specific metrics to specific facets. For instance, for
> > a JAX-RS application, there could be a collection of metrics per
> > endpoint (average/mean/max/min response times). For EJBs, the same but
> > per "business method". For JPA, it could collect the time spent on the
> > DB (and perhaps even aggregate that by prepared statement).
> > 
> > For system resources, I think it would need to be something that would
> > run with (high?) privileges, so that it can take data like disk space
> > in all partitions, CPU/memory/swap usage, network traffic, ...
> > 

You just described RHQ ;)

> > > Can the agent be told about resources that it cannot
> > > auto-discover?
> > 
> > Do you have a sample use case for this? What I can imagine is: web
> > application 'Acme Travel System' is deployed but the agent can't
> > detect it automatically. Something tells the agent that there's a web
> > application deployed.
> 
> Exactly. That's right.
> 
> > But still, if it cannot see it, how can it measure it?
> 
> An prime example is we know an EAP instance is there, but we don't have the
> credentials to probe it. So the agent could know an EAP instance is there,
> but what's the admin password? Without it, we can't get that EAP's metrics;
> with it, we can.
> 
> Or, another example is that there is an EAP instance but it is not running.
> Agent has no idea where it is (or even that it exists). 

That is true unless we do discovery by filesystem scan (which I DO NOT 
propose, just pointing out the possibility).

> We can tell the
> agent "There is an EAP in directory '/my/app/foo' --- and oh, by the way,
> its admin credentials are 'admin/foo'".
> 
> Or, another example is that its running on a non-default port that the agent
> wasn't expecting and therefore didn't see. So we'd be able to say "Acme
> Travel System is installed on port 9091 (not the default 8080)."
> 
> So, there could be many reasons why an agent cannot auto-detect something
> but, with a little help from a human administrator to give it some hints,
> the agent can be told about a resource and be able to manage it that
> resource.
> > > Does the agent need to be pluggable so it can load and interface
> > > with plugins to handle monitoring of different resources? What are
> > > these plugins?
> > 
> > Not sure I understand the question, but I'd say yes: as an application
> > developer, I want to be able to add custom metrics, such as "# of
> > logged in users", "# of registrations", "# of checkouts" and so on.
> 
> And not just add new, custom metrics, but add entirely new resource types,
> too. For example, we'd like Hawkular to be able to monitor all JBoss
> middleware products. Suppose we do that - Hawkular 1.0 is a success and
> people are using it to monitor every JBoss middleware product in existence.
> But the following year, JBoss might introduce a completely new "JBoss
> Awesome Product version 1.0" - we'd like an easy way to plug in new
> capabilities to an existing and running Hawkular system so it can quickly
> start managing JBoss Awesome Product instances without having to upgrade
> the entire Hawkular system.

Well, in here I think we need to answer a crucial question. Do we want to 
continue the RHQ path of "writing agent plugin for X" or should we rather try 
to focus on "providing X with easy to use tools to report to Hawkular", i.e. 
easily consumable REST API, possibly some other niceties for Java-based 
applications, like annotations for declarative monitoring, or some kind of 
Wildfly extension that could be configured inject itself into deployments and 
collect configured metrics (byteman-based bytecode enhancements?), etc.

The latter I think is favorable if for nothing else then for the fact that it 
creates potentially less work for us and lets the "domain expert", i.e. the 
developer of X think about what and how to report instead of us learning X in 
haste and second guessing.

IMHO the agent exists only for the cases where embedding the "feed" into the 
managed software itself is not possible.

Note that the embedded feed has other benefits, too. For one, it is (or at 
least has a chance of being) in-process with the data, and more importantly, 
if it is in-process, it doesn't create any security concerns.

> _______________________________________________
> hawkular-dev mailing list
> hawkular-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/hawkular-dev