[wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)

Brian Stansberry brian.stansberry at redhat.com
Wed Jun 11 12:21:33 EDT 2014


I do think these are orthogonal and should not be combined.

The existing attribute is fundamentally about how the state of the 
runtime services relates to the persistent configuration.

STARTING == out of sync due to still getting in sync during start
RUNNING == in sync
RELOAD_REQURIRED = out of sync, needs a reload to get in sync
RESTART_REQUIRED = out of sync, needs a full process restart to get in sync

There are two problems though with the existing attribute that exposes this:

1) It's named "server-state" on a server and "host-state" on a Host 
Controller. Really crappy name; way too broad.

That's fixable by creating a new attribute and making the old one an 
alias for compatibility purposes.

2) The RUNNING state is really poorly named.

The could perhaps be fixed by coming up with a new name and translating 
it back to "RUNNING" in the handlers for the legacy "server-state" and 
"host-state" attributes.


On 6/10/14, 11:21 AM, Dimitris Andreadis wrote:
> Sure. Which justifies trying to avoid those issues in the first place ;)
>
> On 10/06/2014 17:50, Stuart Douglas wrote:
>> We can't really change that now, as it is part of our existing API.
>>
>> Stuart
>>
>> Dimitris Andreadis wrote:
>>> It seems to me RESTART_REQUIRED (or RELOAD_REQUIRED) should be a boolean
>>> on its own to simplify the state diagram.
>>>
>>> On 10/06/2014 17:40, Stuart Douglas wrote:
>>>> I don't think so, I think RESTART_REQUIRED means running, but I need
>>>> to restart to apply
>>>> management changes (I think that attribute can also be
>>>> RELOAD_REQUIRED, I think the
>>>> description may be a bit out of date).
>>>>
>>>> To accurately reflect all the possible states you would need something
>>>> like:
>>>>
>>>> RUNNING
>>>> PAUSING,
>>>> PAUSED,
>>>> RESTART_REQUIRED
>>>> PAUSING_RESTART_REQUIRED
>>>> PAUSED_RESTART_REQUIRED
>>>> RELOAD_REQUIRED
>>>> PAUSING_RELOAD_REQUIRED
>>>> PAUSED_RELOAD_REQUIRED
>>>>
>>>> Which does not seem great, and may introduce compatibility problems
>>>> for clients that are not
>>>> expecting these new values.
>>>>
>>>> Stuart
>>>>
>>>>
>>>>
>>>> Dimitris Andreadis wrote:
>>>>> Isn't RESTART_REQUIRED also orthogonal to RUNNING?
>>>>>
>>>>> On 10/06/2014 17:17, Stuart Douglas wrote:
>>>>>> They are actually orthogonal, a server can be in both RESTART_REQUIRED
>>>>>> and any one of the
>>>>>> suspend states.
>>>>>>
>>>>>> RESTART_REQUIRED is very much tied to services and the management
>>>>>> model, while
>>>>>> suspend/resume is a runtime only thing that should not touch the state
>>>>>> of services.
>>>>>>
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>> Dimitris Andreadis wrote:
>>>>>>> Why not extend the states of the existing 'server-state' attribute to:
>>>>>>>
>>>>>>> (STARTING, RUNNING, SUSPENDING, SUSPENDED, RESTART_REQUIRED RUNNING)
>>>>>>>
>>>>>>> http://wildscribe.github.io/Wildfly/8.0.0.Final/index.html
>>>>>>>
>>>>>>> On 10/06/2014 04:40, Stuart Douglas wrote:
>>>>>>>>
>>>>>>>> Scott Marlow wrote:
>>>>>>>>> On 06/09/2014 06:38 PM, Stuart Douglas wrote:
>>>>>>>>>> Server suspend and resume is a feature that allows a running
>>>>>>>>>> server to
>>>>>>>>>> gracefully finish of all running requests. The most common use
>>>>>>>>>> case for
>>>>>>>>>> this is graceful shutdown, where you would like a server to
>>>>>>>>>> complete all
>>>>>>>>>> running requests, reject any new ones, and then shut down, however
>>>>>>>>>> there
>>>>>>>>>> are also plenty of other valid use cases (e.g. suspend the server,
>>>>>>>>>> modify a data source or some other config, then resume).
>>>>>>>>>>
>>>>>>>>>> User View:
>>>>>>>>>>
>>>>>>>>>> For the users point of view two new operations will be added to
>>>>>>>>>> the server:
>>>>>>>>>>
>>>>>>>>>> suspend(timeout)
>>>>>>>>>> resume()
>>>>>>>>>>
>>>>>>>>>> A runtime only attribute suspend-state (is this a good name?) will
>>>>>>>>>> also
>>>>>>>>>> be added, that can take one of three possible values, RUNNING,
>>>>>>>>>> SUSPENDING, SUSPENDED.
>>>>>>>>> The SuspendController "state" might be a shorter attribute name and
>>>>>>>>> just
>>>>>>>>> as meaningful.
>>>>>>>> This will be in the global server namespace (i.e. from the CLI
>>>>>>>> :read-attribute(name="suspend-state").
>>>>>>>>
>>>>>>>> I think the name 'state' is just two generic, which kind of state
>>>>>>>> are we
>>>>>>>> talking about?
>>>>>>>>
>>>>>>>>> When are we in the RUNNING state? Is that simply the pre-state for
>>>>>>>>> SUSPENDING?
>>>>>>>> 99.99% of the time. Basically servers are always running unless they
>>>>>>>> are
>>>>>>>> have been explicitly suspended, and then they go from suspending to
>>>>>>>> suspended. Note that if resume is called at any time the server
>>>>>>>> goes to
>>>>>>>> RUNNING again immediately, as when subsystems are notified they
>>>>>>>> should
>>>>>>>> be able to begin accepting requests again straight away.
>>>>>>>>
>>>>>>>> We also have admin only mode, which is a kinda similar concept, so we
>>>>>>>> need to make sure we document the differences.
>>>>>>>>
>>>>>>>>>> A timeout attribute will also be added to the shutdown
>>>>>>>>>> operation. If
>>>>>>>>>> this is present then the server will first be suspended, and the
>>>>>>>>>> server
>>>>>>>>>> will not shut down until either the suspend is successful or the
>>>>>>>>>> timeout
>>>>>>>>>> occurs. If no timeout parameter is passed to the operation then a
>>>>>>>>>> normal
>>>>>>>>>> non-graceful shutdown will take place.
>>>>>>>>> Will non-graceful shutdown wait for non-daemon threads or terminate
>>>>>>>>> immediately (call System.exit()).
>>>>>>>> It will execute the same way it does today (all services will shut
>>>>>>>> down
>>>>>>>> and then the server will exit).
>>>>>>>>
>>>>>>>> Stuart
>>>>>>>>
>>>>>>>>>> In domain mode these operations will be added to both individual
>>>>>>>>>> server
>>>>>>>>>> and a complete server group.
>>>>>>>>>>
>>>>>>>>>> Implementation Details
>>>>>>>>>>
>>>>>>>>>> Suspend/resume operates on entry points to the server. Any request
>>>>>>>>>> that
>>>>>>>>>> is currently running must not be affected by the suspend state,
>>>>>>>>>> however
>>>>>>>>>> any new request should be rejected. In general subsystems will
>>>>>>>>>> track the
>>>>>>>>>> number of outstanding requests, and when this hits zero they are
>>>>>>>>>> considered suspended.
>>>>>>>>>>
>>>>>>>>>> We will introduce the notion of a global SuspendController, that
>>>>>>>>>> manages
>>>>>>>>>> the servers suspend state. All subsystems that wish to do a
>>>>>>>>>> graceful
>>>>>>>>>> shutdown register callback handlers with this controller.
>>>>>>>>>>
>>>>>>>>>> When the suspend() operation is invoked the controller will invoke
>>>>>>>>>> all
>>>>>>>>>> these callbacks, letting the subsystem know that the server is
>>>>>>>>>> suspend,
>>>>>>>>>> and providing the subsystem with a SuspendContext object that the
>>>>>>>>>> subsystem can then use to notify the controller that the suspend is
>>>>>>>>>> complete.
>>>>>>>>>>
>>>>>>>>>> What the subsystem does when it receives a suspend command, and
>>>>>>>>>> when it
>>>>>>>>>> considers itself suspended will vary, but in the common case it
>>>>>>>>>> will
>>>>>>>>>> immediatly start rejecting external requests (e.g. Undertow will
>>>>>>>>>> start
>>>>>>>>>> responding with a 503 to all new requests). The subsystem will also
>>>>>>>>>> track the number of outstanding requests, and when this hits zero
>>>>>>>>>> then
>>>>>>>>>> the subsystem will notify the controller that is has successfully
>>>>>>>>>> suspended.
>>>>>>>>>> Some subsystems will obviously want to do other actions on
>>>>>>>>>> suspend, e.g.
>>>>>>>>>> clustering will likely want to fail over, mod_cluster will
>>>>>>>>>> notify the
>>>>>>>>>> load balancer that the node is no longer available etc. In some
>>>>>>>>>> cases we
>>>>>>>>>> may want to make this configurable to an extent (e.g. Undertow
>>>>>>>>>> could be
>>>>>>>>>> configured to allow requests with an existing session, and not
>>>>>>>>>> consider
>>>>>>>>>> itself timed out until all sessions have either timed out or been
>>>>>>>>>> invalidated, although this will obviously take a while).
>>>>>>>>>>
>>>>>>>>>> If anyone has any feedback let me know. In terms of
>>>>>>>>>> implementation my
>>>>>>>>>> basic plan is to get the core functionality and the Undertow
>>>>>>>>>> implementation into Wildfly, and then work with subsystem
>>>>>>>>>> authors to
>>>>>>>>>> implement subsystem specific functionality once the core is in
>>>>>>>>>> place.
>>>>>>>>>>
>>>>>>>>>> Stuart
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The
>>>>>>>>>>
>>>>>>>>>> A timeout attribute will also be added to the shutdown command,
>>>>>>>>>> _______________________________________________
>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> wildfly-dev mailing list
>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> wildfly-dev mailing list
>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
> _______________________________________________
> wildfly-dev mailing list
> wildfly-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>


-- 
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat


More information about the wildfly-dev mailing list