Hi,
I had a look at the Eclipse MicroProfile Healthcheck spec[1] and wanted to share some
thoughts and experiments about it, how it relates to WildFly and its use in containers
(such as OpenShift).
# Eclipse MicroProfile Healthcheck
The Eclipse MicroProfile Healthcheck (MPHC for short) is a specification to determine the
healthiness of an application.
It defines a Health Check Procedure (HCP for short) interface that can be implemented by
application to determine its healthiness. It’s a single method that returns a Health
Status: either UP or DOWN (+ some metadata).
Typically, an application would provide one or more HCP to check healthiness of its
parts.
The overall healthiness of the application is determined by the aggregation of all the
HCP provided by the application. If any HCP is DOWN, the overall outcome is DOWN. Else the
application is considered as UP.
The MPHC spec has a companion document[2] that specifies an HTTP format to check the
healthiness of an application.
Heiko is leading the spec and Swarm is the sample implementation for it (MicroProfile
does not have the notion of reference implementation).
The spec is still in flux and we have a good opportunity to contribute to it to ensure
that it meets our requirements and use cases.
# Use case
Using the HTTP endpoint, a container can ask an application whether it is healthy. If it
is not healthy, the container could stop the application and respin a new instance.
For example, OpenShift/Kubernetes can configure liveness probes[3][4].
Supporting MPHC in WildFly would allow a better integration with containers and ensure
that any unhealthy WildFly process is restarted promptly.
# Prototype
I’ve written a prototype of a WildFly extension to support MPHC for applications deployed
in WildFly *and* add health check procedures inside WildFly:
https://github.com/jmesnil/wildfly-microprofile-health
and it passes the MPHC tck :)
The microprofile-health subsystem supports an operation to check the health of the app
server:
[standalone@localhost:9990 /] /subsystem=microprofile-health:check
{
"outcome" => "success",
"result" => {
"checks" => [{
"id" => "heap-memory",
"result" => "UP",
"data" => {
"max" => "477626368",
"used" => "156216336"
}
}],
"outcome" => "UP"
}
}
It also exposes an (unauthenticated) HTTP endpoint:
$ curl
http://localhost:8080/health/:
{
"checks":[
{
"id":"heap-memory",
"result":"UP",
"data":{
"max":"477626368",
"used":"160137128"
}
}
],
"outcome":"UP"
}
This HTTP endpoint can be used by OpenShift for its liveness probe.
Any deployment that defines Health Check Procedures will have them registered to
determine the overall healthiness of the process.
# WildFly health check procedures
The MPHC specification mainly targets user applications that can apply application logic
to determine their healthiness.
However I wonder if we could reuse the concepts *inside* WildFly. There are things that
we could check to determine if the App server runtime is healthy, e.g.:
* The amount of heap memory is close to the max
* some deployments have failed
* Excessive GC
* Running out of disk space
Subsystems inside WildFly could provide Health check procedures that would be queried to
check the overall healthiness.
We could for example provide a health check that the used heap memory is less that 90% of
the max:
HealthCheck.install(context, "heap-memory", () -> {
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
long memUsed = memoryBean.getHeapMemoryUsage().getUsed();
long memMax = memoryBean.getHeapMemoryUsage().getMax();
HealthResponse response = HealthResponse.named("heap-memory")
.withAttribute("used", memUsed)
.withAttribute("max", memMax);
// status is is down is used memory is greater than 90% of max memory.
HealthStatus status = (memUsed < memMax * 0.9) ? response.up() :
response.down();
return status;
});
HealthCheck.install creates a MSC service and makes sure that is is registered by the
health monitor that queries all the procedures.
A subsystem would just have to call HealthCheck.install/uninstall with a Health check
procedures to help determine the healthiness of the app server.
What do you think about this use case?
I even wonder if this is something that should be instead provided by our core-management
subsystem with a private API (1 interface and some data structures).
The microprofile-health extension would then map our private API to the MPHC spec and
handled health check procedures coming from deployments.
# Summary
To better integrate WildFly with OpenShift, we should provide a way to let OpenShift
checks the healthiness of WildFly. The MPHC spec is a good candidate to provide such
feature.
It is worth exploring how we could leverage it for user deployments and also for WildFly
internals (when that makes sense).
Swarm is providing an implementation of the MPHC, we also need to see how we can
collaborate between WildFly and Swarm to avoid duplicating code and efforts from providing
the same feature to our users.
I like the idea of having a WildFly health API that can bridge to MPHC
via a subsystem; this is consistent with what we've done in other
areas. I'm not so sure about having (more?) APIs which drive
services. It might be better to use cap/req to have a health
capability to which other systems can be registered. This might allow
multiple independent health check resources to be defined, for systems
which perform more than one function; downstream health providers
could reference the resource(s) to register with by capability name.
Is this a polling-only service, or is there a "push" mechanism?
Just brainstorming, I can think of a few more potentially useful
health checks beyond what you've listed:
• EJB failure rate (if an EJB starts failing more than some percentage
of the last, say 50 or 100 invocations, it could report an "unhealthy"
condition)
• Database failure rate (something with JDBC exceptions maybe)
• Authentication realm failure rate (Elytron's RealmUnavailableException)
--
- DML