[wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)

Wed Jun 11 12:47:02 EDT 2014

The STARTING state in the existing attribute makes me think an 
equivalent thing is needed for this concept.

STARTING in the existing means the runtime services are possibly out of 
sync due to boot.

Doesn't a similar problem exist with RUNNING, SUSPENDING, SUSPENDED? 
It's about how the server is reacting to external requests. There's some 
state during boot/reload when the server is not reacting normally to 
external requests.

Perhaps that's just another condition where the server is SUSPENDED.

This leads to whether this whole mechanism can be used to provide 
"Graceful Startup". We have problems with this now; endpoints accepting 
requests before everything is fully ready, leading to things like 404s 
because a deployment is installed yet.

On 6/11/14, 11:21 AM, Brian Stansberry wrote:
> I do think these are orthogonal and should not be combined.
>
> The existing attribute is fundamentally about how the state of the
> runtime services relates to the persistent configuration.
>
> STARTING == out of sync due to still getting in sync during start
> RUNNING == in sync
> RELOAD_REQURIRED = out of sync, needs a reload to get in sync
> RESTART_REQUIRED = out of sync, needs a full process restart to get in sync
>
> There are two problems though with the existing attribute that exposes this:
>
> 1) It's named "server-state" on a server and "host-state" on a Host
> Controller. Really crappy name; way too broad.
>
> That's fixable by creating a new attribute and making the old one an
> alias for compatibility purposes.
>
> 2) The RUNNING state is really poorly named.
>
> The could perhaps be fixed by coming up with a new name and translating
> it back to "RUNNING" in the handlers for the legacy "server-state" and
> "host-state" attributes.
>
>
> On 6/10/14, 11:21 AM, Dimitris Andreadis wrote:
>> Sure. Which justifies trying to avoid those issues in the first place ;)
>>
>> On 10/06/2014 17:50, Stuart Douglas wrote:
>>> We can't really change that now, as it is part of our existing API.
>>>
>>> Stuart
>>>
>>> Dimitris Andreadis wrote:
>>>> It seems to me RESTART_REQUIRED (or RELOAD_REQUIRED) should be a boolean
>>>> on its own to simplify the state diagram.
>>>>
>>>> On 10/06/2014 17:40, Stuart Douglas wrote:
>>>>> I don't think so, I think RESTART_REQUIRED means running, but I need
>>>>> to restart to apply
>>>>> management changes (I think that attribute can also be
>>>>> RELOAD_REQUIRED, I think the
>>>>> description may be a bit out of date).
>>>>>
>>>>> To accurately reflect all the possible states you would need something
>>>>> like:
>>>>>
>>>>> RUNNING
>>>>> PAUSING,
>>>>> PAUSED,
>>>>> RESTART_REQUIRED
>>>>> PAUSING_RESTART_REQUIRED
>>>>> PAUSED_RESTART_REQUIRED
>>>>> RELOAD_REQUIRED
>>>>> PAUSING_RELOAD_REQUIRED
>>>>> PAUSED_RELOAD_REQUIRED
>>>>>
>>>>> Which does not seem great, and may introduce compatibility problems
>>>>> for clients that are not
>>>>> expecting these new values.
>>>>>
>>>>> Stuart
>>>>>
>>>>>
>>>>>
>>>>> Dimitris Andreadis wrote:
>>>>>> Isn't RESTART_REQUIRED also orthogonal to RUNNING?
>>>>>>
>>>>>> On 10/06/2014 17:17, Stuart Douglas wrote:
>>>>>>> They are actually orthogonal, a server can be in both RESTART_REQUIRED
>>>>>>> and any one of the
>>>>>>> suspend states.
>>>>>>>
>>>>>>> RESTART_REQUIRED is very much tied to services and the management
>>>>>>> model, while
>>>>>>> suspend/resume is a runtime only thing that should not touch the state
>>>>>>> of services.
>>>>>>>
>>>>>>>
>>>>>>> Stuart
>>>>>>>
>>>>>>> Dimitris Andreadis wrote:
>>>>>>>> Why not extend the states of the existing 'server-state' attribute to:
>>>>>>>>
>>>>>>>> (STARTING, RUNNING, SUSPENDING, SUSPENDED, RESTART_REQUIRED RUNNING)
>>>>>>>>
>>>>>>>> http://wildscribe.github.io/Wildfly/8.0.0.Final/index.html
>>>>>>>>
>>>>>>>> On 10/06/2014 04:40, Stuart Douglas wrote:
>>>>>>>>>
>>>>>>>>> Scott Marlow wrote:
>>>>>>>>>> On 06/09/2014 06:38 PM, Stuart Douglas wrote:
>>>>>>>>>>> Server suspend and resume is a feature that allows a running
>>>>>>>>>>> server to
>>>>>>>>>>> gracefully finish of all running requests. The most common use
>>>>>>>>>>> case for
>>>>>>>>>>> this is graceful shutdown, where you would like a server to
>>>>>>>>>>> complete all
>>>>>>>>>>> running requests, reject any new ones, and then shut down, however
>>>>>>>>>>> there
>>>>>>>>>>> are also plenty of other valid use cases (e.g. suspend the server,
>>>>>>>>>>> modify a data source or some other config, then resume).
>>>>>>>>>>>
>>>>>>>>>>> User View:
>>>>>>>>>>>
>>>>>>>>>>> For the users point of view two new operations will be added to
>>>>>>>>>>> the server:
>>>>>>>>>>>
>>>>>>>>>>> suspend(timeout)
>>>>>>>>>>> resume()
>>>>>>>>>>>
>>>>>>>>>>> A runtime only attribute suspend-state (is this a good name?) will
>>>>>>>>>>> also
>>>>>>>>>>> be added, that can take one of three possible values, RUNNING,
>>>>>>>>>>> SUSPENDING, SUSPENDED.
>>>>>>>>>> The SuspendController "state" might be a shorter attribute name and
>>>>>>>>>> just
>>>>>>>>>> as meaningful.
>>>>>>>>> This will be in the global server namespace (i.e. from the CLI
>>>>>>>>> :read-attribute(name="suspend-state").
>>>>>>>>>
>>>>>>>>> I think the name 'state' is just two generic, which kind of state
>>>>>>>>> are we
>>>>>>>>> talking about?
>>>>>>>>>
>>>>>>>>>> When are we in the RUNNING state? Is that simply the pre-state for
>>>>>>>>>> SUSPENDING?
>>>>>>>>> 99.99% of the time. Basically servers are always running unless they
>>>>>>>>> are
>>>>>>>>> have been explicitly suspended, and then they go from suspending to
>>>>>>>>> suspended. Note that if resume is called at any time the server
>>>>>>>>> goes to
>>>>>>>>> RUNNING again immediately, as when subsystems are notified they
>>>>>>>>> should
>>>>>>>>> be able to begin accepting requests again straight away.
>>>>>>>>>
>>>>>>>>> We also have admin only mode, which is a kinda similar concept, so we
>>>>>>>>> need to make sure we document the differences.
>>>>>>>>>
>>>>>>>>>>> A timeout attribute will also be added to the shutdown
>>>>>>>>>>> operation. If
>>>>>>>>>>> this is present then the server will first be suspended, and the
>>>>>>>>>>> server
>>>>>>>>>>> will not shut down until either the suspend is successful or the
>>>>>>>>>>> timeout
>>>>>>>>>>> occurs. If no timeout parameter is passed to the operation then a
>>>>>>>>>>> normal
>>>>>>>>>>> non-graceful shutdown will take place.
>>>>>>>>>> Will non-graceful shutdown wait for non-daemon threads or terminate
>>>>>>>>>> immediately (call System.exit()).
>>>>>>>>> It will execute the same way it does today (all services will shut
>>>>>>>>> down
>>>>>>>>> and then the server will exit).
>>>>>>>>>
>>>>>>>>> Stuart
>>>>>>>>>
>>>>>>>>>>> In domain mode these operations will be added to both individual
>>>>>>>>>>> server
>>>>>>>>>>> and a complete server group.
>>>>>>>>>>>
>>>>>>>>>>> Implementation Details
>>>>>>>>>>>
>>>>>>>>>>> Suspend/resume operates on entry points to the server. Any request
>>>>>>>>>>> that
>>>>>>>>>>> is currently running must not be affected by the suspend state,
>>>>>>>>>>> however
>>>>>>>>>>> any new request should be rejected. In general subsystems will
>>>>>>>>>>> track the
>>>>>>>>>>> number of outstanding requests, and when this hits zero they are
>>>>>>>>>>> considered suspended.
>>>>>>>>>>>
>>>>>>>>>>> We will introduce the notion of a global SuspendController, that
>>>>>>>>>>> manages
>>>>>>>>>>> the servers suspend state. All subsystems that wish to do a
>>>>>>>>>>> graceful
>>>>>>>>>>> shutdown register callback handlers with this controller.
>>>>>>>>>>>
>>>>>>>>>>> When the suspend() operation is invoked the controller will invoke
>>>>>>>>>>> all
>>>>>>>>>>> these callbacks, letting the subsystem know that the server is
>>>>>>>>>>> suspend,
>>>>>>>>>>> and providing the subsystem with a SuspendContext object that the
>>>>>>>>>>> subsystem can then use to notify the controller that the suspend is
>>>>>>>>>>> complete.
>>>>>>>>>>>
>>>>>>>>>>> What the subsystem does when it receives a suspend command, and
>>>>>>>>>>> when it
>>>>>>>>>>> considers itself suspended will vary, but in the common case it
>>>>>>>>>>> will
>>>>>>>>>>> immediatly start rejecting external requests (e.g. Undertow will
>>>>>>>>>>> start
>>>>>>>>>>> responding with a 503 to all new requests). The subsystem will also
>>>>>>>>>>> track the number of outstanding requests, and when this hits zero
>>>>>>>>>>> then
>>>>>>>>>>> the subsystem will notify the controller that is has successfully
>>>>>>>>>>> suspended.
>>>>>>>>>>> Some subsystems will obviously want to do other actions on
>>>>>>>>>>> suspend, e.g.
>>>>>>>>>>> clustering will likely want to fail over, mod_cluster will
>>>>>>>>>>> notify the
>>>>>>>>>>> load balancer that the node is no longer available etc. In some
>>>>>>>>>>> cases we
>>>>>>>>>>> may want to make this configurable to an extent (e.g. Undertow
>>>>>>>>>>> could be
>>>>>>>>>>> configured to allow requests with an existing session, and not
>>>>>>>>>>> consider
>>>>>>>>>>> itself timed out until all sessions have either timed out or been
>>>>>>>>>>> invalidated, although this will obviously take a while).
>>>>>>>>>>>
>>>>>>>>>>> If anyone has any feedback let me know. In terms of
>>>>>>>>>>> implementation my
>>>>>>>>>>> basic plan is to get the core functionality and the Undertow
>>>>>>>>>>> implementation into Wildfly, and then work with subsystem
>>>>>>>>>>> authors to
>>>>>>>>>>> implement subsystem specific functionality once the core is in
>>>>>>>>>>> place.
>>>>>>>>>>>
>>>>>>>>>>> Stuart
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The
>>>>>>>>>>>
>>>>>>>>>>> A timeout attribute will also be added to the shutdown command,
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> wildfly-dev mailing list
>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>> _______________________________________________
>> wildfly-dev mailing list
>> wildfly-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
>
>

-- 
Brian Stansberry
Senior Principal Software Engineer
JBoss by Red Hat