[wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)

Mon Jun 9 18:38:55 EDT 2014

Server suspend and resume is a feature that allows a running server to 
gracefully finish of all running requests. The most common use case for 
this is graceful shutdown, where you would like a server to complete all 
running requests, reject any new ones, and then shut down, however there 
are also plenty of other valid use cases (e.g. suspend the server, 
modify a data source or some other config, then resume).

User View:

For the users point of view two new operations will be added to the server:

suspend(timeout)
resume()

A runtime only attribute suspend-state (is this a good name?) will also 
be added, that can take one of three possible values, RUNNING, 
SUSPENDING, SUSPENDED.

A timeout attribute will also be added to the shutdown operation. If 
this is present then the server will first be suspended, and the server 
will not shut down until either the suspend is successful or the timeout 
occurs. If no timeout parameter is passed to the operation then a normal 
non-graceful shutdown will take place.

In domain mode these operations will be added to both individual server 
and a complete server group.

Implementation Details

Suspend/resume operates on entry points to the server. Any request that 
is currently running must not be affected by the suspend state, however 
any new request should be rejected. In general subsystems will track the 
number of outstanding requests, and when this hits zero they are 
considered suspended.

We will introduce the notion of a global SuspendController, that manages 
the servers suspend state. All subsystems that wish to do a graceful 
shutdown register callback handlers with this controller.

When the suspend() operation is invoked the controller will invoke all 
these callbacks, letting the subsystem know that the server is suspend, 
and providing the subsystem with a SuspendContext object that the 
subsystem can then use to notify the controller that the suspend is 
complete.

What the subsystem does when it receives a suspend command, and when it 
considers itself suspended will vary, but in the common case it will 
immediatly start rejecting external requests (e.g. Undertow will start 
responding with a 503 to all new requests). The subsystem will also 
track the number of outstanding requests, and when this hits zero then 
the subsystem will notify the controller that is has successfully 
suspended.
Some subsystems will obviously want to do other actions on suspend, e.g. 
clustering will likely want to fail over, mod_cluster will notify the 
load balancer that the node is no longer available etc. In some cases we 
may want to make this configurable to an extent (e.g. Undertow could be 
configured to allow requests with an existing session, and not consider 
itself timed out until all sessions have either timed out or been 
invalidated, although this will obviously take a while).

If anyone has any feedback let me know. In terms of implementation my 
basic plan is to get the core functionality and the Undertow 
implementation into Wildfly, and then work with subsystem authors to 
implement subsystem specific functionality once the core is in place.

Stuart

The

A timeout attribute will also be added to the shutdown command,