[wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)

Tue Jun 10 11:50:08 EDT 2014

We can't really change that now, as it is part of our existing API.

Stuart

Dimitris Andreadis wrote:
> It seems to me RESTART_REQUIRED (or RELOAD_REQUIRED) should be a boolean
> on its own to simplify the state diagram.
>
> On 10/06/2014 17:40, Stuart Douglas wrote:
>> I don't think so, I think RESTART_REQUIRED means running, but I need
>> to restart to apply
>> management changes (I think that attribute can also be
>> RELOAD_REQUIRED, I think the
>> description may be a bit out of date).
>>
>> To accurately reflect all the possible states you would need something
>> like:
>>
>> RUNNING
>> PAUSING,
>> PAUSED,
>> RESTART_REQUIRED
>> PAUSING_RESTART_REQUIRED
>> PAUSED_RESTART_REQUIRED
>> RELOAD_REQUIRED
>> PAUSING_RELOAD_REQUIRED
>> PAUSED_RELOAD_REQUIRED
>>
>> Which does not seem great, and may introduce compatibility problems
>> for clients that are not
>> expecting these new values.
>>
>> Stuart
>>
>>
>>
>> Dimitris Andreadis wrote:
>>> Isn't RESTART_REQUIRED also orthogonal to RUNNING?
>>>
>>> On 10/06/2014 17:17, Stuart Douglas wrote:
>>>> They are actually orthogonal, a server can be in both RESTART_REQUIRED
>>>> and any one of the
>>>> suspend states.
>>>>
>>>> RESTART_REQUIRED is very much tied to services and the management
>>>> model, while
>>>> suspend/resume is a runtime only thing that should not touch the state
>>>> of services.
>>>>
>>>>
>>>> Stuart
>>>>
>>>> Dimitris Andreadis wrote:
>>>>> Why not extend the states of the existing 'server-state' attribute to:
>>>>>
>>>>> (STARTING, RUNNING, SUSPENDING, SUSPENDED, RESTART_REQUIRED RUNNING)
>>>>>
>>>>> http://wildscribe.github.io/Wildfly/8.0.0.Final/index.html
>>>>>
>>>>> On 10/06/2014 04:40, Stuart Douglas wrote:
>>>>>>
>>>>>> Scott Marlow wrote:
>>>>>>> On 06/09/2014 06:38 PM, Stuart Douglas wrote:
>>>>>>>> Server suspend and resume is a feature that allows a running
>>>>>>>> server to
>>>>>>>> gracefully finish of all running requests. The most common use
>>>>>>>> case for
>>>>>>>> this is graceful shutdown, where you would like a server to
>>>>>>>> complete all
>>>>>>>> running requests, reject any new ones, and then shut down, however
>>>>>>>> there
>>>>>>>> are also plenty of other valid use cases (e.g. suspend the server,
>>>>>>>> modify a data source or some other config, then resume).
>>>>>>>>
>>>>>>>> User View:
>>>>>>>>
>>>>>>>> For the users point of view two new operations will be added to
>>>>>>>> the server:
>>>>>>>>
>>>>>>>> suspend(timeout)
>>>>>>>> resume()
>>>>>>>>
>>>>>>>> A runtime only attribute suspend-state (is this a good name?) will
>>>>>>>> also
>>>>>>>> be added, that can take one of three possible values, RUNNING,
>>>>>>>> SUSPENDING, SUSPENDED.
>>>>>>> The SuspendController "state" might be a shorter attribute name and
>>>>>>> just
>>>>>>> as meaningful.
>>>>>> This will be in the global server namespace (i.e. from the CLI
>>>>>> :read-attribute(name="suspend-state").
>>>>>>
>>>>>> I think the name 'state' is just two generic, which kind of state
>>>>>> are we
>>>>>> talking about?
>>>>>>
>>>>>>> When are we in the RUNNING state? Is that simply the pre-state for
>>>>>>> SUSPENDING?
>>>>>> 99.99% of the time. Basically servers are always running unless they
>>>>>> are
>>>>>> have been explicitly suspended, and then they go from suspending to
>>>>>> suspended. Note that if resume is called at any time the server
>>>>>> goes to
>>>>>> RUNNING again immediately, as when subsystems are notified they
>>>>>> should
>>>>>> be able to begin accepting requests again straight away.
>>>>>>
>>>>>> We also have admin only mode, which is a kinda similar concept, so we
>>>>>> need to make sure we document the differences.
>>>>>>
>>>>>>>> A timeout attribute will also be added to the shutdown
>>>>>>>> operation. If
>>>>>>>> this is present then the server will first be suspended, and the
>>>>>>>> server
>>>>>>>> will not shut down until either the suspend is successful or the
>>>>>>>> timeout
>>>>>>>> occurs. If no timeout parameter is passed to the operation then a
>>>>>>>> normal
>>>>>>>> non-graceful shutdown will take place.
>>>>>>> Will non-graceful shutdown wait for non-daemon threads or terminate
>>>>>>> immediately (call System.exit()).
>>>>>> It will execute the same way it does today (all services will shut
>>>>>> down
>>>>>> and then the server will exit).
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>>>> In domain mode these operations will be added to both individual
>>>>>>>> server
>>>>>>>> and a complete server group.
>>>>>>>>
>>>>>>>> Implementation Details
>>>>>>>>
>>>>>>>> Suspend/resume operates on entry points to the server. Any request
>>>>>>>> that
>>>>>>>> is currently running must not be affected by the suspend state,
>>>>>>>> however
>>>>>>>> any new request should be rejected. In general subsystems will
>>>>>>>> track the
>>>>>>>> number of outstanding requests, and when this hits zero they are
>>>>>>>> considered suspended.
>>>>>>>>
>>>>>>>> We will introduce the notion of a global SuspendController, that
>>>>>>>> manages
>>>>>>>> the servers suspend state. All subsystems that wish to do a
>>>>>>>> graceful
>>>>>>>> shutdown register callback handlers with this controller.
>>>>>>>>
>>>>>>>> When the suspend() operation is invoked the controller will invoke
>>>>>>>> all
>>>>>>>> these callbacks, letting the subsystem know that the server is
>>>>>>>> suspend,
>>>>>>>> and providing the subsystem with a SuspendContext object that the
>>>>>>>> subsystem can then use to notify the controller that the suspend is
>>>>>>>> complete.
>>>>>>>>
>>>>>>>> What the subsystem does when it receives a suspend command, and
>>>>>>>> when it
>>>>>>>> considers itself suspended will vary, but in the common case it
>>>>>>>> will
>>>>>>>> immediatly start rejecting external requests (e.g. Undertow will
>>>>>>>> start
>>>>>>>> responding with a 503 to all new requests). The subsystem will also
>>>>>>>> track the number of outstanding requests, and when this hits zero
>>>>>>>> then
>>>>>>>> the subsystem will notify the controller that is has successfully
>>>>>>>> suspended.
>>>>>>>> Some subsystems will obviously want to do other actions on
>>>>>>>> suspend, e.g.
>>>>>>>> clustering will likely want to fail over, mod_cluster will
>>>>>>>> notify the
>>>>>>>> load balancer that the node is no longer available etc. In some
>>>>>>>> cases we
>>>>>>>> may want to make this configurable to an extent (e.g. Undertow
>>>>>>>> could be
>>>>>>>> configured to allow requests with an existing session, and not
>>>>>>>> consider
>>>>>>>> itself timed out until all sessions have either timed out or been
>>>>>>>> invalidated, although this will obviously take a while).
>>>>>>>>
>>>>>>>> If anyone has any feedback let me know. In terms of
>>>>>>>> implementation my
>>>>>>>> basic plan is to get the core functionality and the Undertow
>>>>>>>> implementation into Wildfly, and then work with subsystem
>>>>>>>> authors to
>>>>>>>> implement subsystem specific functionality once the core is in
>>>>>>>> place.
>>>>>>>>
>>>>>>>> Stuart
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The
>>>>>>>>
>>>>>>>> A timeout attribute will also be added to the shutdown command,
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> wildfly-dev mailing list
>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>> _______________________________________________
>>>>>> wildfly-dev mailing list
>>>>>> wildfly-dev at lists.jboss.org
>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>
>>>>> _______________________________________________
>>>>> wildfly-dev mailing list
>>>>> wildfly-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev