[wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)
Michael Musgrove
mmusgrov at redhat.com
Tue Jun 10 05:51:02 EDT 2014
I agree with Stuart, it should wait for the transaction to finish before
shutting down.
And yes (with caveats), Jason, when the timeout is reached our transaction reaper will abort the transaction. However, if the transaction was started with a timeout value of 0 it will never abort. Also, if the suspend happens when there are prepared transactions then it's too late to cancel the transaction and they will be recovered when the system is resumed.
Note also that suspending before an in-flight transaction has prepared is probably safe since the resource will either:
- rollback the branch if all connections to the db are closed (when the system suspends); or
- rollback the branch if the XAResource timeout (set via the XAResource.setTransactionTimeout()) value is reached
[And since it was never prepared we have no log record for it so we would not do anything on resume]
Mike
> IIRC the behavior for a tx timeout is a rollback, but we should check that.
>
>> On Jun 9, 2014, at 6:50 PM, Stuart Douglas <stuart.w.douglas at gmail.com> wrote:
>>
>> Something I forgot to mention is that we will need a switch to turn this off, as there is a small but noticeable cost with tracking in flight requests.
>>
>>
>>> On 9 Jun 2014, at 18:04, Andrig Miller <anmiller at redhat.com> wrote:
>>>
>>> What I am bringing up is more subsystem specific, but it might be valuable to think about. In case of the time out of the graceful shutdown, what behavior would we consider correct in terms of an inflight transaction?
>> It waits for the transaction to finish before shutting down.
>>
>> Stuart
>>
>>> Should it be a forced rollback, so that when the server is started back up, the transaction manager will not find in the log a transaction to be recovered?
>>>
>>> Or, should it be considered the same as a crashed state, where transactions should be recoverable, and the recover manager wouuld try to recover the transaction?
>>>
>>> I would lean towards the first, as this would be considered graceful by the administrator, and having a transaction be in a state where it would be recovered on a restart, doesn't seem graceful to me.
>>>
>>> Andy
>>>
>>> ----- Original Message -----
>>>> From: "Stuart Douglas" <sdouglas at redhat.com>
>>>> To: "Wildfly Dev mailing list" <wildfly-dev at lists.jboss.org>
>>>> Sent: Monday, June 9, 2014 4:38:55 PM
>>>> Subject: [wildfly-dev] Design Proposal: Server suspend/resume (AKA Graceful Shutdown)
>>>>
>>>> Server suspend and resume is a feature that allows a running server
>>>> to
>>>> gracefully finish of all running requests. The most common use case
>>>> for
>>>> this is graceful shutdown, where you would like a server to complete
>>>> all
>>>> running requests, reject any new ones, and then shut down, however
>>>> there
>>>> are also plenty of other valid use cases (e.g. suspend the server,
>>>> modify a data source or some other config, then resume).
>>>>
>>>> User View:
>>>>
>>>> For the users point of view two new operations will be added to the
>>>> server:
>>>>
>>>> suspend(timeout)
>>>> resume()
>>>>
>>>> A runtime only attribute suspend-state (is this a good name?) will
>>>> also
>>>> be added, that can take one of three possible values, RUNNING,
>>>> SUSPENDING, SUSPENDED.
>>>>
>>>> A timeout attribute will also be added to the shutdown operation. If
>>>> this is present then the server will first be suspended, and the
>>>> server
>>>> will not shut down until either the suspend is successful or the
>>>> timeout
>>>> occurs. If no timeout parameter is passed to the operation then a
>>>> normal
>>>> non-graceful shutdown will take place.
>>>>
>>>> In domain mode these operations will be added to both individual
>>>> server
>>>> and a complete server group.
>>>>
>>>> Implementation Details
>>>>
>>>> Suspend/resume operates on entry points to the server. Any request
>>>> that
>>>> is currently running must not be affected by the suspend state,
>>>> however
>>>> any new request should be rejected. In general subsystems will track
>>>> the
>>>> number of outstanding requests, and when this hits zero they are
>>>> considered suspended.
>>>>
>>>> We will introduce the notion of a global SuspendController, that
>>>> manages
>>>> the servers suspend state. All subsystems that wish to do a graceful
>>>> shutdown register callback handlers with this controller.
>>>>
>>>> When the suspend() operation is invoked the controller will invoke
>>>> all
>>>> these callbacks, letting the subsystem know that the server is
>>>> suspend,
>>>> and providing the subsystem with a SuspendContext object that the
>>>> subsystem can then use to notify the controller that the suspend is
>>>> complete.
>>>>
>>>> What the subsystem does when it receives a suspend command, and when
>>>> it
>>>> considers itself suspended will vary, but in the common case it will
>>>> immediatly start rejecting external requests (e.g. Undertow will
>>>> start
>>>> responding with a 503 to all new requests). The subsystem will also
>>>> track the number of outstanding requests, and when this hits zero
>>>> then
>>>> the subsystem will notify the controller that is has successfully
>>>> suspended.
>>>> Some subsystems will obviously want to do other actions on suspend,
>>>> e.g.
>>>> clustering will likely want to fail over, mod_cluster will notify the
>>>> load balancer that the node is no longer available etc. In some cases
>>>> we
>>>> may want to make this configurable to an extent (e.g. Undertow could
>>>> be
>>>> configured to allow requests with an existing session, and not
>>>> consider
>>>> itself timed out until all sessions have either timed out or been
>>>> invalidated, although this will obviously take a while).
>>>>
>>>> If anyone has any feedback let me know. In terms of implementation my
>>>> basic plan is to get the core functionality and the Undertow
>>>> implementation into Wildfly, and then work with subsystem authors to
>>>> implement subsystem specific functionality once the core is in place.
>>>>
>>>> Stuart
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The
>>>>
>>>> A timeout attribute will also be added to the shutdown command,
>>>> _______________________________________________
>>>> wildfly-dev mailing list
>>>> wildfly-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>> _______________________________________________
>>> wildfly-dev mailing list
>>> wildfly-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
>> _______________________________________________
>> wildfly-dev mailing list
>> wildfly-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/wildfly-dev
> _______________________________________________
> wildfly-dev mailing list
> wildfly-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/wildfly-dev
--
Michael Musgrove
Transactions Team
e: mmusgrov at redhat.com
t: +44 191 243 0870
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham (US), Charles Peters (US), Matt Parson (US), Michael O'Neill(Ireland)
More information about the wildfly-dev
mailing list