On 6 Jul 2017, at 15:00, David Lloyd <dlloyd(a)redhat.com>
wrote:
> To better integrate WildFly with OpenShift, we should provide a way to let OpenShift
checks the healthiness of WildFly. The MPHC spec is a good candidate to provide such
feature.
> It is worth exploring how we could leverage it for user deployments and also for
WildFly internals (when that makes sense).
> Swarm is providing an implementation of the MPHC, we also need to see how we can
collaborate between WildFly and Swarm to avoid duplicating code and efforts from providing
the same feature to our users.
I like the idea of having a WildFly health API that can bridge to MPHC
via a subsystem; this is consistent with what we've done in other
areas. I'm not so sure about having (more?) APIs which drive
services. It might be better to use cap/req to have a health
capability to which other systems can be registered. This might allow
multiple independent health check resources to be defined, for systems
which perform more than one function; downstream health providers
could reference the resource(s) to register with by capability name.
You are right.
If we provide our own health API, it will rely on req/cap to bind everything.
My idea was to provide an API that hides the req/cap plumbing but is built on top of it.
It’d be similar to what I’m doing in the messaging-activemq subsystem where I almost
always hides the installation of service in static install() methods such as [1] that
requires only some parameters and hides all the dependencies/capabilities service names
and injection)
[
1]https://github.com/wildfly/wildfly/blob/master/messaging-activemq/src/m...
Is this a polling-only service, or is there a "push"
mechanism?
Polling only.
The container (OpenShift) will call the HTTP endpoint regularly to check the application
healthiness.
Just brainstorming, I can think of a few more potentially useful
health checks beyond what you've listed:
• EJB failure rate (if an EJB starts failing more than some percentage
of the last, say 50 or 100 invocations, it could report an "unhealthy"
condition)
• Database failure rate (something with JDBC exceptions maybe)
That one is interesting.
I proposed a health check that pings a JDBC connection to Heiko when we talked about the
API and he told me that might be a bad idea after all.
If the database fails, the application will not function as expected. But restarting the
application will not make the problem goes away (it’s likely the DB that has a problem).
Having health checks that cross service boundaries (such as "my app" <—>
“DB”) may have a snowballing effect where one unhealthy service (the DB) would propagate
its unhealthiness to other services (“my app”).
In that case, the DB should be probed and restarted asap but there is nothing that should
be done in the app server.
We would need guidelines to determine which health checks actually makes sense for WildFly
extensions.
Caucho has an interesting list of health checks[1] that could make sense for WildFly.
There is the usual suspects (memory, CPU) and some more interesting ones:
* JVM deadlock check
* transaction failure rate
We’d have to be careful implementing a JVM deadlock health check though.
These health checks should not impact the app server runtime too much and should be fast
(by default Kubernetes has a 1 second timeout for its liveness probe).
jeff
[1]
http://www.caucho.com/resin-4.0/admin/health-checking.xtp#Defaulthealthco...
--
Jeff Mesnil
JBoss, a division of Red Hat
http://jmesnil.net/