How do we explain the drops in requests in the following example:
Shouldn't it always be the same load ?
Do we know if it's an it's an issue in APM, or is it that the load
balancing "lose" some requests as we scale down a server ? or else ?
With the changes that are now going to include Prometheus, how do we want
to deploy this in OpenShift?
We can have a few options:
We put both Hawkular Services and Prometheus in the same container.
- easy to deploy in plain docker (but this doesn't appear to be a usecase
we are targetting anyways)
- shares the same network connection (even localhost) and ip address (eg
but both services are on the different ports).
- Does't require any special wiring of components.
- Can share the same volume mount
- version of components can't get out of sync.
- workflow doesn't work nicely. Docker containers are meant to only run a
single application and running two can cause problems. Eg lifecycle events
would become tricky and require some hacks to get around things.
- can't independently deploy things
- can't reuse or share any existing Prometheus docker containers.
Hawkular Services and Prometheus are in their own containers, but they are
both deployed within the same pod.
- shares the same network connection.
- bound to the same machine (useful if sharing the same hostpath pv) and
don' need to worry about external network configurations (eg firewalls
between OpenShift nodes)
- pvs can be shared or separate.
- lifecycle events will work properly.
- lifecycle hooks will mean that both containers will have to pass before
either one will enter the ready state. So if Prometheus is failing for some
reason, Hawkular Services will not be available under the service.
- cannot independently update one container. If we need to deploy a new
container we will need to bring down the whole pod.
- are stuck with a 1:1 ratio between Hawkular Services and Prometheus
Hawkular Services and Prometheus have their own separate pods.
- can independently run components and each component has its own separate
- if in the future we want to cluster Hawkular Services. this will make it
a lot easier and will also allow for running an n:m ratio between Hawkular
Services and Prometheus
- probably the more 'correct' way to deploy things as we don't have a
strong requirement for Hawkular Services and Prometheus to run together.
- more complex wiring. We will need to have extra services and routes
created to handle this. This mean more things running and more chances for
things to go wrong. Also more things to configure
- reusing a PV between Hawkular Services and Prometheus could be more
challenging (especially if we are using hostpath pvs). Updating the
Prometheus scrape endpoint may require a new component and container.
In the context of tasks of HAWKULAR-1275 I think that moving those config
files inside manageiq-providers-hawkular may make sense as probably I need
to split them per type (EAP6 might have different metrics than EAP7, for
I guess it shouldn't be a problem as our provider is the only user for this.
Does anyone see any issue if I perform this change ?
For Hawkular Services, we want to be able to handle monitoring EAP
instances no matter where they are running.
So we could have some eap instances running on bare metal, running in a vm,
running as docker images somewhere, running in various OpenShift or
For baremetal and vm instances, this should be similar to how we have
handled them in the past.
For OpenShift or Kubernetes, I am not sure if we have figure out how this
should function. Particularly with metric endpoints that need to be
accessed from outside of the OpenShift cluster.
If we are running Hawkular Services in an OpenShift cluster and monitoring
eap pods within that cluster, by default Hawkular Services should be able
to communicate with all the eap pods in the cluster by their ip address. So
this is not much of an issue.
But, if the ovs-multitenant SDN plugin is enabled instead, then only pods
within the same project can communicate with each other. So if we are
running Hawkular Services in one project we cannot reach the metric
endpoint of eap instances running in another project. Running Hawkular
Services in the 'default' project (vnid0) gives it special privileges to
read from any pod, but this also means that only admins will be able to
There is also the new ovs-networkpolicy plugin, which allows for Kubernetes
network policy. And this may further limit communication between pods.
If we move Hawkular Services outside of the OpenShift cluster, then this
can get tricky and I don't know what we can really do here. Even if we were
to have Hawkular Services run with the same network setup as OpenShift (so
it can access the pod endpoints) I don't think we can do this with multiple
Normally, if you want to expose something outside of an OpenShift cluster,
you would do so using a route. But this is not going to work for individual
pods in a replica set.
There is also the API proxy that could be used to access individual pod
endpoints, but I think this could cause a performance problem. And the
agent may not know the endpoint to tell p8s to start scraping from.
Has anyone started to look into this yet?
We currently are currently doing both push (inventory) and pull (metrics).
Which means we are going to have to deal configuring things on both ends,
and handling security here might get interesting.
For push, we need to pass to the agent:
- the url for Hawkular Services
- the username & password
- the CA certificate (optional; if Hawkular Services is using tls with
And we need to make sure that Hawkular Services is signed with a
certificate valid for its hostname and make sure its easy to export the CA
certificate so that its easy to pass on to the agents.
For pull, this might get a bit tricky.
To access a pod's metric endpoint we will need to do so using its ip
address, and to do this properly the certificate used for the metric
endpoint must be valid for that ip address. Since the ip address of a pod
is not known before a pod is created, this means we need something to
dynamically generate a certificate for us which we fetch at startup. This
also means we cannot have a common secret containing the certificate that
can be shared across replica sets.
To do this properly with pods may require a lot of extra effort. With
'pets' its a lot easier.
Even if we have properly signed certificates, there is also a question of
how we get the CA for those certificates into Prometheus.
Do we really need to have p8s trust the certificate for the endpoint which
is being exposed? Or could we configure p8s to trust any certificate
without validating it first? There is no extra verification if someone
decides to use a non-https endpoint for instance.
I see a few options here, but I might be missing other options as well:
1) by default we check for certificate validation, but we allow an override
to disable it. If someone really wants to use certificate validation with
pods, then they can figure out on their how to get the right certificates
into the pod to be used by the agent.
2) we provide some service which when an agent registers with inventory, we
generate a certificate and key they can use (signed by our own CA). The
metrics endpoint then uses this certificate.
3) we do something like not expose an http endpoint at the agent, but
tunnel this to Hawkular Services. P8s could then read the metric endpoints
directly from Hawkular Services.
I'm looking at how domain mode monitoring is going to work now that we are moving toward the Prometheus / jmx-exporter way of doing things.
Turns out, due to the way WildFly domain mode exposes things via JMX (rather, I should say, *doesn't* expose things) compared to DMR, I think we are going to have to require the hawkular agent to be deployed to all slave servers, in addition to the master host controller.
In domain mode, you can access all the child slave servers (the individual servers in all server groups) via the host controller's DMR tree - so we only needed one agent on the host controller to monitor the domain. For example, to see a metric for a deployment on "server-one" that is managed by host controller "master", I can simply ask the host controller for it:
"outcome" => "success",
"result" => 0
The nice thing about WildFly's management DMR interface is that the domain mode tree is identical to the standalone tree with the exception that on the host controller, you simply prepend the host/server identifiers to the DMR path. So, for a standalone WildFly, a DMR path for the active-sessions metric for my test-simple application would be:
If I want to ask the host controller to give me the exact same metric from that server, I simply prepend the host/server name in the DMR path:
The agent has knowledge of this pattern; knowing this pattern we can cleverly configure the metadata so we can share types across domain and standalone for inventory discovery.
The problem: WildFly does not expose its JMX MBeans in an equally clever way. I do not see MBeans that provide metrics for the slave servers.
For example, in JMX, I see WildFly host controller has this MBean:
Looks very analogous to the DMR resource named /host=master/server=server-one resource ... right??
Well, there are no deployments associated with this MBean. You would think (following the DMR pattern) that there would be an MBean named:
But there is not. Nor is there:
which is where you would expect that server's web subsystem metrics to be (if it were to follow the same DMR pattern). But, again, this doesn't exist.
I can't find JMX MBeans for anything related to the individual slave servers (not just deployments).
In short, I do not believe the Prometheus JMX Exporter can be used to collect metrics for all slave servers in a domain if that JMX Exporter is simply installed on a host controller. This is because the host controller's JMX MBeans do not expose metric data for individual slave servers.
We would have to have our agent in each slave server (which are just standalone servers - so the agent would be as if it is monitoring a standalone server). We could have an agent in the host controller, too, but it would only be responsible for monitoring/managing the host controller itself.
[this message was sent on Tuesday, November 21, 2017 at 10:11 PM EST]