[wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)

Stuart Douglas stuart.w.douglas at gmail.com
Thu Jun 12 10:45:00 EDT 2014



Brian Stansberry wrote:
> The STARTING state in the existing attribute makes me think an
> equivalent thing is needed for this concept.

This is a good idea, we could do this by just having a serverStarted() 
notification that gets sent to all subsystems. This will allow them to 
basically start in a paused state, and only allow access after the 
server is up.

Stuart

>
> STARTING in the existing means the runtime services are possibly out of
> sync due to boot.
>
> Doesn't a similar problem exist with RUNNING, SUSPENDING, SUSPENDED?
> It's about how the server is reacting to external requests. There's some
> state during boot/reload when the server is not reacting normally to
> external requests.
>
> Perhaps that's just another condition where the server is SUSPENDED.
>
> This leads to whether this whole mechanism can be used to provide
> "Graceful Startup". We have problems with this now; endpoints accepting
> requests before everything is fully ready, leading to things like 404s
> because a deployment is installed yet.
>
> On 6/11/14, 11:21 AM, Brian Stansberry wrote:
>> I do think these are orthogonal and should not be combined.
>>
>> The existing attribute is fundamentally about how the state of the
>> runtime services relates to the persistent configuration.
>>
>> STARTING == out of sync due to still getting in sync during start
>> RUNNING == in sync
>> RELOAD_REQURIRED = out of sync, needs a reload to get in sync
>> RESTART_REQUIRED = out of sync, needs a full process restart to get in sync
>>
>> There are two problems though with the existing attribute that exposes this:
>>
>> 1) It's named "server-state" on a server and "host-state" on a Host
>> Controller. Really crappy name; way too broad.
>>
>> That's fixable by creating a new attribute and making the old one an
>> alias for compatibility purposes.
>>
>> 2) The RUNNING state is really poorly named.
>>
>> The could perhaps be fixed by coming up with a new name and translating
>> it back to "RUNNING" in the handlers for the legacy "server-state" and
>> "host-state" attributes.
>>
>>
>> On 6/10/14, 11:21 AM, Dimitris Andreadis wrote:
>>> Sure. Which justifies trying to avoid those issues in the first place ;)
>>>
>>> On 10/06/2014 17:50, Stuart Douglas wrote:
>>>> We can't really change that now, as it is part of our existing API.
>>>>
>>>> Stuart
>>>>
>>>> Dimitris Andreadis wrote:
>>>>> It seems to me RESTART_REQUIRED (or RELOAD_REQUIRED) should be a boolean
>>>>> on its own to simplify the state diagram.
>>>>>
>>>>> On 10/06/2014 17:40, Stuart Douglas wrote:
>>>>>> I don't think so, I think RESTART_REQUIRED means running, but I need
>>>>>> to restart to apply
>>>>>> management changes (I think that attribute can also be
>>>>>> RELOAD_REQUIRED, I think the
>>>>>> description may be a bit out of date).
>>>>>>
>>>>>> To accurately reflect all the possible states you would need something
>>>>>> like:
>>>>>>
>>>>>> RUNNING
>>>>>> PAUSING,
>>>>>> PAUSED,
>>>>>> RESTART_REQUIRED
>>>>>> PAUSING_RESTART_REQUIRED
>>>>>> PAUSED_RESTART_REQUIRED
>>>>>> RELOAD_REQUIRED
>>>>>> PAUSING_RELOAD_REQUIRED
>>>>>> PAUSED_RELOAD_REQUIRED
>>>>>>
>>>>>> Which does not seem great, and may introduce compatibility problems
>>>>>> for clients that are not
>>>>>> expecting these new values.
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dimitris Andreadis wrote:
>>>>>>> Isn't RESTART_REQUIRED also orthogonal to RUNNING?
>>>>>>>
>>>>>>> On 10/06/2014 17:17, Stuart Douglas wrote:
>>>>>>>> They are actually orthogonal, a server can be in both RESTART_REQUIRED
>>>>>>>> and any one of the
>>>>>>>> suspend states.
>>>>>>>>
>>>>>>>> RESTART_REQUIRED is very much tied to services and the management
>>>>>>>> model, while
>>>>>>>> suspend/resume is a runtime only thing that should not touch the state
>>>>>>>> of services.
>>>>>>>>
>>>>>>>>
>>>>>>>> Stuart
>>>>>>>>
>>>>>>>> Dimitris Andreadis wrote:
>>>>>>>>> Why not extend the states of the existing 'server-state' attribute to:
>>>>>>>>>
>>>>>>>>> (STARTING, RUNNING, SUSPENDING, SUSPENDED, RESTART_REQUIRED RUNNING)
>>>>>>>>>
>>>>>>>>> http://wildscribe.github.io/Wildfly/8.0.0.Final/index.html
>>>>>>>>>
>>>>>>>>> On 10/06/2014 04:40, Stuart Douglas wrote:
>>>>>>>>>> Scott Marlow wrote:
>>>>>>>>>>> On 06/09/2014 06:38 PM, Stuart Douglas wrote:
>>>>>>>>>>>> Server suspend and resume is a feature that allows a running
>>>>>>>>>>>> server to
>>>>>>>>>>>> gracefully finish of all running requests. The most common use
>>>>>>>>>>>> case for
>>>>>>>>>>>> this is graceful shutdown, where you would like a server to
>>>>>>>>>>>> complete all
>>>>>>>>>>>> running requests, reject any new ones, and then shut down, however
>>>>>>>>>>>> there
>>>>>>>>>>>> are also plenty of other valid use cases (e.g. suspend the server,
>>>>>>>>>>>> modify a data source or some other config, then resume).
>>>>>>>>>>>>
>>>>>>>>>>>> User View:
>>>>>>>>>>>>
>>>>>>>>>>>> For the users point of view two new operations will be added to
>>>>>>>>>>>> the server:
>>>>>>>>>>>>
>>>>>>>>>>>> suspend(timeout)
>>>>>>>>>>>> resume()
>>>>>>>>>>>>
>>>>>>>>>>>> A runtime only attribute suspend-state (is this a good name?) will
>>>>>>>>>>>> also
>>>>>>>>>>>> be added, that can take one of three possible values, RUNNING,
>>>>>>>>>>>> SUSPENDING, SUSPENDED.
>>>>>>>>>>> The SuspendController "state" might be a shorter attribute name and
>>>>>>>>>>> just
>>>>>>>>>>> as meaningful.
>>>>>>>>>> This will be in the global server namespace (i.e. from the CLI
>>>>>>>>>> :read-attribute(name="suspend-state").
>>>>>>>>>>
>>>>>>>>>> I think the name 'state' is just two generic, which kind of state
>>>>>>>>>> are we
>>>>>>>>>> talking about?
>>>>>>>>>>
>>>>>>>>>>> When are we in the RUNNING state? Is that simply the pre-state for
>>>>>>>>>>> SUSPENDING?
>>>>>>>>>> 99.99% of the time. Basically servers are always running unless they
>>>>>>>>>> are
>>>>>>>>>> have been explicitly suspended, and then they go from suspending to
>>>>>>>>>> suspended. Note that if resume is called at any time the server
>>>>>>>>>> goes to
>>>>>>>>>> RUNNING again immediately, as when subsystems are notified they
>>>>>>>>>> should
>>>>>>>>>> be able to begin accepting requests again straight away.
>>>>>>>>>>
>>>>>>>>>> We also have admin only mode, which is a kinda similar concept, so we
>>>>>>>>>> need to make sure we document the differences.
>>>>>>>>>>
>>>>>>>>>>>> A timeout attribute will also be added to the shutdown
>>>>>>>>>>>> operation. If
>>>>>>>>>>>> this is present then the server will first be suspended, and the
>>>>>>>>>>>> server
>>>>>>>>>>>> will not shut down until either the suspend is successful or the
>>>>>>>>>>>> timeout
>>>>>>>>>>>> occurs. If no timeout parameter is passed to the operation then a
>>>>>>>>>>>> normal
>>>>>>>>>>>> non-graceful shutdown will take place.
>>>>>>>>>>> Will non-graceful shutdown wait for non-daemon threads or terminate
>>>>>>>>>>> immediately (call System.exit()).
>>>>>>>>>> It will execute the same way it does today (all services will shut
>>>>>>>>>> down
>>>>>>>>>> and then the server will exit).
>>>>>>>>>>
>>>>>>>>>> Stuart
>>>>>>>>>>
>>>>>>>>>>>> In domain mode these operations will be added to both individual
>>>>>>>>>>>> server
>>>>>>>>>>>> and a complete server group.
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation Details
>>>>>>>>>>>>
>>>>>>>>>>>> Suspend/resume operates on entry points to the server. Any request
>>>>>>>>>>>> that
>>>>>>>>>>>> is currently running must not be affected by the suspend state,
>>>>>>>>>>>> however
>>>>>>>>>>>> any new request should be rejected. In general subsystems will
>>>>>>>>>>>> track the
>>>>>>>>>>>> number of outstanding requests, and when this hits zero they are
>>>>>>>>>>>> considered suspended.
>>>>>>>>>>>>
>>>>>>>>>>>> We will introduce the notion of a global SuspendController, that
>>>>>>>>>>>> manages
>>>>>>>>>>>> the servers suspend state. All subsystems that wish to do a
>>>>>>>>>>>> graceful
>>>>>>>>>>>> shutdown register callback handlers with this controller.
>>>>>>>>>>>>
>>>>>>>>>>>> When the suspend() operation is invoked the controller will invoke
>>>>>>>>>>>> all
>>>>>>>>>>>> these callbacks, letting the subsystem know that the server is
>>>>>>>>>>>> suspend,
>>>>>>>>>>>> and providing the subsystem with a SuspendContext object that the
>>>>>>>>>>>> subsystem can then use to notify the controller that the suspend is
>>>>>>>>>>>> complete.
>>>>>>>>>>>>
>>>>>>>>>>>> What the subsystem does when it receives a suspend command, and
>>>>>>>>>>>> when it
>>>>>>>>>>>> considers itself suspended will vary, but in the common case it
>>>>>>>>>>>> will
>>>>>>>>>>>> immediatly start rejecting external requests (e.g. Undertow will
>>>>>>>>>>>> start
>>>>>>>>>>>> responding with a 503 to all new requests). The subsystem will also
>>>>>>>>>>>> track the number of outstanding requests, and when this hits zero
>>>>>>>>>>>> then
>>>>>>>>>>>> the subsystem will notify the controller that is has successfully
>>>>>>>>>>>> suspended.
>>>>>>>>>>>> Some subsystems will obviously want to do other actions on
>>>>>>>>>>>> suspend, e.g.
>>>>>>>>>>>> clustering will likely want to fail over, mod_cluster will
>>>>>>>>>>>> notify the
>>>>>>>>>>>> load balancer that the node is no longer available etc. In some
>>>>>>>>>>>> cases we
>>>>>>>>>>>> may want to make this configurable to an extent (e.g. Undertow
>>>>>>>>>>>> could be
>>>>>>>>>>>> configured to allow requests with an existing session, and not
>>>>>>>>>>>> consider
>>>>>>>>>>>> itself timed out until all sessions have either timed out or been
>>>>>>>>>>>> invalidated, although this will obviously take a while).
>>>>>>>>>>>>
>>>>>>>>>>>> If anyone has any feedback let me know. In terms of
>>>>>>>>>>>> implementation my
>>>>>>>>>>>> basic plan is to get the core functionality and the Undertow
>>>>>>>>>>>> implementation into Wildfly, and then work with subsystem
>>>>>>>>>>>> authors to
>>>>>>>>>>>> implement subsystem specific functionality once the core is in
>>>>>>>>>>>> place.
>>>>>>>>>>>>
>>>>>>>>>>>> Stuart
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The
>>>>>>>>>>>>
>>>>>>>>>>>> A timeout attribute will also be added to the shutdown command,
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> wildfly-dev mailing list
>>>>>>>>> wildfly-dev at lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>> _______________________________________________
>>> wildfly-dev mailing list
>>> wildfly-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>
>>
>
>


More information about the wildfly-dev mailing list