Re: [wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)

Thursday, 12 June 2014

Brian Stansberry wrote:
...
 The STARTING state in the existing attribute makes me think an
 equivalent thing is needed for this concept. 
This is a good idea, we could do this by just having a serverStarted() 
notification that gets sent to all subsystems. This will allow them to 
basically start in a paused state, and only allow access after the 
server is up.

Stuart

...

 STARTING in the existing means the runtime services are possibly out of
 sync due to boot.

 Doesn't a similar problem exist with RUNNING, SUSPENDING, SUSPENDED?
 It's about how the server is reacting to external requests. There's some
 state during boot/reload when the server is not reacting normally to
 external requests.

 Perhaps that's just another condition where the server is SUSPENDED.

 This leads to whether this whole mechanism can be used to provide
 "Graceful Startup". We have problems with this now; endpoints accepting
 requests before everything is fully ready, leading to things like 404s
 because a deployment is installed yet.

 On 6/11/14, 11:21 AM, Brian Stansberry wrote:
> I do think these are orthogonal and should not be combined.
>
> The existing attribute is fundamentally about how the state of the
> runtime services relates to the persistent configuration.
>
> STARTING == out of sync due to still getting in sync during start
> RUNNING == in sync
> RELOAD_REQURIRED = out of sync, needs a reload to get in sync
> RESTART_REQUIRED = out of sync, needs a full process restart to get in sync
>
> There are two problems though with the existing attribute that exposes this:
>
> 1) It's named "server-state" on a server and "host-state" on
a Host
> Controller. Really crappy name; way too broad.
>
> That's fixable by creating a new attribute and making the old one an
> alias for compatibility purposes.
>
> 2) The RUNNING state is really poorly named.
>
> The could perhaps be fixed by coming up with a new name and translating
> it back to "RUNNING" in the handlers for the legacy
"server-state" and
> "host-state" attributes.
>
>
> On 6/10/14, 11:21 AM, Dimitris Andreadis wrote:
>> Sure. Which justifies trying to avoid those issues in the first place ;)
>>
>> On 10/06/2014 17:50, Stuart Douglas wrote:
>>> We can't really change that now, as it is part of our existing API.
>>>
>>> Stuart
>>>
>>> Dimitris Andreadis wrote:
>>>> It seems to me RESTART_REQUIRED (or RELOAD_REQUIRED) should be a boolean
>>>> on its own to simplify the state diagram.
>>>>
>>>> On 10/06/2014 17:40, Stuart Douglas wrote:
>>>>> I don't think so, I think RESTART_REQUIRED means running, but I
need
>>>>> to restart to apply
>>>>> management changes (I think that attribute can also be
>>>>> RELOAD_REQUIRED, I think the
>>>>> description may be a bit out of date).
>>>>>
>>>>> To accurately reflect all the possible states you would need
something
>>>>> like:
>>>>>
>>>>> RUNNING
>>>>> PAUSING,
>>>>> PAUSED,
>>>>> RESTART_REQUIRED
>>>>> PAUSING_RESTART_REQUIRED
>>>>> PAUSED_RESTART_REQUIRED
>>>>> RELOAD_REQUIRED
>>>>> PAUSING_RELOAD_REQUIRED
>>>>> PAUSED_RELOAD_REQUIRED
>>>>>
>>>>> Which does not seem great, and may introduce compatibility problems
>>>>> for clients that are not
>>>>> expecting these new values.
>>>>>
>>>>> Stuart
>>>>>
>>>>>
>>>>>
>>>>> Dimitris Andreadis wrote:
>>>>>> Isn't RESTART_REQUIRED also orthogonal to RUNNING?
>>>>>>
>>>>>> On 10/06/2014 17:17, Stuart Douglas wrote:
>>>>>>> They are actually orthogonal, a server can be in both
RESTART_REQUIRED
>>>>>>> and any one of the
>>>>>>> suspend states.
>>>>>>>
>>>>>>> RESTART_REQUIRED is very much tied to services and the
management
>>>>>>> model, while
>>>>>>> suspend/resume is a runtime only thing that should not touch
the state
>>>>>>> of services.
>>>>>>>
>>>>>>>
>>>>>>> Stuart
>>>>>>>
>>>>>>> Dimitris Andreadis wrote:
>>>>>>>> Why not extend the states of the existing
'server-state' attribute to:
>>>>>>>>
>>>>>>>> (STARTING, RUNNING, SUSPENDING, SUSPENDED,
RESTART_REQUIRED RUNNING)
>>>>>>>>
>>>>>>>>
http://wildscribe.github.io/Wildfly/8.0.0.Final/index.html
>>>>>>>>
>>>>>>>> On 10/06/2014 04:40, Stuart Douglas wrote:
>>>>>>>>> Scott Marlow wrote:
>>>>>>>>>> On 06/09/2014 06:38 PM, Stuart Douglas wrote:
>>>>>>>>>>> Server suspend and resume is a feature that
allows a running
>>>>>>>>>>> server to
>>>>>>>>>>> gracefully finish of all running requests.
The most common use
>>>>>>>>>>> case for
>>>>>>>>>>> this is graceful shutdown, where you would
like a server to
>>>>>>>>>>> complete all
>>>>>>>>>>> running requests, reject any new ones, and
then shut down, however
>>>>>>>>>>> there
>>>>>>>>>>> are also plenty of other valid use cases
(e.g. suspend the server,
>>>>>>>>>>> modify a data source or some other config,
then resume).
>>>>>>>>>>>
>>>>>>>>>>> User View:
>>>>>>>>>>>
>>>>>>>>>>> For the users point of view two new
operations will be added to
>>>>>>>>>>> the server:
>>>>>>>>>>>
>>>>>>>>>>> suspend(timeout)
>>>>>>>>>>> resume()
>>>>>>>>>>>
>>>>>>>>>>> A runtime only attribute suspend-state (is
this a good name?) will
>>>>>>>>>>> also
>>>>>>>>>>> be added, that can take one of three possible
values, RUNNING,
>>>>>>>>>>> SUSPENDING, SUSPENDED.
>>>>>>>>>> The SuspendController "state" might be
a shorter attribute name and
>>>>>>>>>> just
>>>>>>>>>> as meaningful.
>>>>>>>>> This will be in the global server namespace (i.e.
from the CLI
>>>>>>>>> :read-attribute(name="suspend-state").
>>>>>>>>>
>>>>>>>>> I think the name 'state' is just two generic,
which kind of state
>>>>>>>>> are we
>>>>>>>>> talking about?
>>>>>>>>>
>>>>>>>>>> When are we in the RUNNING state? Is that simply
the pre-state for
>>>>>>>>>> SUSPENDING?
>>>>>>>>> 99.99% of the time. Basically servers are always
running unless they
>>>>>>>>> are
>>>>>>>>> have been explicitly suspended, and then they go from
suspending to
>>>>>>>>> suspended. Note that if resume is called at any time
the server
>>>>>>>>> goes to
>>>>>>>>> RUNNING again immediately, as when subsystems are
notified they
>>>>>>>>> should
>>>>>>>>> be able to begin accepting requests again straight
away.
>>>>>>>>>
>>>>>>>>> We also have admin only mode, which is a kinda
similar concept, so we
>>>>>>>>> need to make sure we document the differences.
>>>>>>>>>
>>>>>>>>>>> A timeout attribute will also be added to the
shutdown
>>>>>>>>>>> operation. If
>>>>>>>>>>> this is present then the server will first be
suspended, and the
>>>>>>>>>>> server
>>>>>>>>>>> will not shut down until either the suspend
is successful or the
>>>>>>>>>>> timeout
>>>>>>>>>>> occurs. If no timeout parameter is passed to
the operation then a
>>>>>>>>>>> normal
>>>>>>>>>>> non-graceful shutdown will take place.
>>>>>>>>>> Will non-graceful shutdown wait for non-daemon
threads or terminate
>>>>>>>>>> immediately (call System.exit()).
>>>>>>>>> It will execute the same way it does today (all
services will shut
>>>>>>>>> down
>>>>>>>>> and then the server will exit).
>>>>>>>>>
>>>>>>>>> Stuart
>>>>>>>>>
>>>>>>>>>>> In domain mode these operations will be added
to both individual
>>>>>>>>>>> server
>>>>>>>>>>> and a complete server group.
>>>>>>>>>>>
>>>>>>>>>>> Implementation Details
>>>>>>>>>>>
>>>>>>>>>>> Suspend/resume operates on entry points to
the server. Any request
>>>>>>>>>>> that
>>>>>>>>>>> is currently running must not be affected by
the suspend state,
>>>>>>>>>>> however
>>>>>>>>>>> any new request should be rejected. In
general subsystems will
>>>>>>>>>>> track the
>>>>>>>>>>> number of outstanding requests, and when this
hits zero they are
>>>>>>>>>>> considered suspended.
>>>>>>>>>>>
>>>>>>>>>>> We will introduce the notion of a global
SuspendController, that
>>>>>>>>>>> manages
>>>>>>>>>>> the servers suspend state. All subsystems
that wish to do a
>>>>>>>>>>> graceful
>>>>>>>>>>> shutdown register callback handlers with this
controller.
>>>>>>>>>>>
>>>>>>>>>>> When the suspend() operation is invoked the
controller will invoke
>>>>>>>>>>> all
>>>>>>>>>>> these callbacks, letting the subsystem know
that the server is
>>>>>>>>>>> suspend,
>>>>>>>>>>> and providing the subsystem with a
SuspendContext object that the
>>>>>>>>>>> subsystem can then use to notify the
controller that the suspend is
>>>>>>>>>>> complete.
>>>>>>>>>>>
>>>>>>>>>>> What the subsystem does when it receives a
suspend command, and
>>>>>>>>>>> when it
>>>>>>>>>>> considers itself suspended will vary, but in
the common case it
>>>>>>>>>>> will
>>>>>>>>>>> immediatly start rejecting external requests
(e.g. Undertow will
>>>>>>>>>>> start
>>>>>>>>>>> responding with a 503 to all new requests).
The subsystem will also
>>>>>>>>>>> track the number of outstanding requests, and
when this hits zero
>>>>>>>>>>> then
>>>>>>>>>>> the subsystem will notify the controller that
is has successfully
>>>>>>>>>>> suspended.
>>>>>>>>>>> Some subsystems will obviously want to do
other actions on
>>>>>>>>>>> suspend, e.g.
>>>>>>>>>>> clustering will likely want to fail over,
mod_cluster will
>>>>>>>>>>> notify the
>>>>>>>>>>> load balancer that the node is no longer
available etc. In some
>>>>>>>>>>> cases we
>>>>>>>>>>> may want to make this configurable to an
extent (e.g. Undertow
>>>>>>>>>>> could be
>>>>>>>>>>> configured to allow requests with an existing
session, and not
>>>>>>>>>>> consider
>>>>>>>>>>> itself timed out until all sessions have
either timed out or been
>>>>>>>>>>> invalidated, although this will obviously
take a while).
>>>>>>>>>>>
>>>>>>>>>>> If anyone has any feedback let me know. In
terms of
>>>>>>>>>>> implementation my
>>>>>>>>>>> basic plan is to get the core functionality
and the Undertow
>>>>>>>>>>> implementation into Wildfly, and then work
with subsystem
>>>>>>>>>>> authors to
>>>>>>>>>>> implement subsystem specific functionality
once the core is in
>>>>>>>>>>> place.
>>>>>>>>>>>
>>>>>>>>>>> Stuart
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The
>>>>>>>>>>>
>>>>>>>>>>> A timeout attribute will also be added to the
shutdown command,
>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> wildfly-dev mailing list
>>>>>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> wildfly-dev mailing list
>>>>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>> _______________________________________________
>> wildfly-dev mailing list
>> wildfly-dev(a)lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>
>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)