Apologise for the delay in replying, but I forgot to check this folder (ping me directly
in future too) ;)
On 8 Jul 2014, at 05:34, Stuart Douglas <stuart.w.douglas(a)gmail.com> wrote:
Mark Little wrote:
> Are we talking about really shutting down the app server or just suspending it, with
a view to resuming activity later? The answer is different for these two scenarios.
>
It could be either.
> If we just consider the former, then since the transaction system can cope with a
crash failure and ensure consistency, technically we could just SIGKILL (or equivalent)
and let it tidy up later. Of course we don't want to do that because we could end up
with heuristics etc. So I assume what we're trying to achieve is as graceful a
shutdown as possible, reducing the number of log recoveries and amount of tidying up we
need to do.
In the case of a fully graceful shutdown the transaction subsystem should not have to do
anything, new requests will be rejected at the connector (or entry point) level, and
existing requests will continue to run as normal. The question is what to do if the
requests don't finish within some timeout, in this case from a transactions point of
view I think recovery is inevitable.
The transaction system should be able to treat this just as a crash failure. Though of
course a graceful shutdown should be attempted first.
>
> We should wait for the transaction to terminate and prevent any new transactions from
starting. There's a mechanism in Narayana which allows it to be started/stopped (or
suspended/resumed) so that no new transactions can be created once the switch is flicked.
Those inflight transactions that haven't been prepared *could* have setRollbackOnly
called on them to force them to end quicker (potentially add a JBoss specific option to
change the timeout value as well - speed up time, but that would be risky). Those
transactions which get past prepare need to be waited upon, but obviously there's a
possibility they could block. We have some interrupt code in the transaction reaper to
help there and that should be sufficient.
>
> Using a combination of the above would help to shorten the time to wait in the case
there are a lot of inflight transactions.
The problem is that this should only happen after the initial timeout has been hit.
Before that the server has to work as normal for all currently running requests.
You could potentially do this after the initial timeout has been hit, but then you would
need a second timeout to see if it has worked. This makes me think it is worth the extra
complexity, especially given that it is not going to be very graceful anyway.
In HP-AS (HP Application Server) "back in the day", we simply prevented the
creation of further transactions immediately the app server was going down, ran through
termination of other services such as CORBA etc. and then allowed the process to
terminate. Any inflight transactions were tidied up by recovery.
The other stuff I mentioned, such as setRollbackOnly, would only be applicable if we
wanted to provide a more graceful way for applications to terminate or to prevent
heuristics, which could happen if there's a failure after the prepare phase. The
former is a nice-to-have, whereas the latter is definitely something we should be worried
about.
Mark.
Stuart
>
> Mark.
>
>
> On 10 Jun 2014, at 14:13, Stuart Douglas<sdouglas(a)redhat.com> wrote:
>
>>> If the transaction will never abort, can we force a rollback? This could
lead to a never ending "graceful" shutdown.
>>>
>> That is why we have a timeout, once the timeout is done the server will
>> shutdown anyway. We should probably have some kind of post-timeout
>> callback that gets invoked, so the TX subsystem could potentially take
>> some action.
>>
>> I'm not sure what the best action would be, the subsystem specific
>> details will be up to the team that maintains the subsystem.
>>
>> Stuart
>>
>>> Andy
>>>
>>>> Note also that suspending before an in-flight transaction has
>>>> prepared is probably safe since the resource will either:
>>>>
>>>> - rollback the branch if all connections to the db are closed (when
>>>> the system suspends); or
>>>> - rollback the branch if the XAResource timeout (set via the
>>>> XAResource.setTransactionTimeout()) value is reached
>>>>
>>>> [And since it was never prepared we have no log record for it so we
>>>> would not do anything on resume]
>>>>
>>>> Mike
>>>>
>>>>> IIRC the behavior for a tx timeout is a rollback, but we should
>>>>> check that.
>>>>>
>>>>>> On Jun 9, 2014, at 6:50 PM, Stuart Douglas
>>>>>> <stuart.w.douglas(a)gmail.com> wrote:
>>>>>>
>>>>>> Something I forgot to mention is that we will need a switch to
>>>>>> turn this off, as there is a small but noticeable cost with
>>>>>> tracking in flight requests.
>>>>>>
>>>>>>
>>>>>>> On 9 Jun 2014, at 18:04, Andrig
Miller<anmiller(a)redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> What I am bringing up is more subsystem specific, but it
might be
>>>>>>> valuable to think about. In case of the time out of the
>>>>>>> graceful shutdown, what behavior would we consider correct
in
>>>>>>> terms of an inflight transaction?
>>>>>> It waits for the transaction to finish before shutting down.
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>>> Should it be a forced rollback, so that when the server is
>>>>>>> started back up, the transaction manager will not find in
the
>>>>>>> log a transaction to be recovered?
>>>>>>>
>>>>>>> Or, should it be considered the same as a crashed state,
where
>>>>>>> transactions should be recoverable, and the recover manager
>>>>>>> wouuld try to recover the transaction?
>>>>>>>
>>>>>>> I would lean towards the first, as this would be considered
>>>>>>> graceful by the administrator, and having a transaction be in
a
>>>>>>> state where it would be recovered on a restart, doesn't
seem
>>>>>>> graceful to me.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Stuart
Douglas"<sdouglas(a)redhat.com>
>>>>>>>> To: "Wildfly Dev mailing
list"<wildfly-dev(a)lists.jboss.org>
>>>>>>>> Sent: Monday, June 9, 2014 4:38:55 PM
>>>>>>>> Subject: [wildfly-dev] Design Proposal: Server
suspend/resume
>>>>>>>> (AKA Graceful Shutdown)
>>>>>>>>
>>>>>>>> Server suspend and resume is a feature that allows a
running
>>>>>>>> server
>>>>>>>> to
>>>>>>>> gracefully finish of all running requests. The most
common use
>>>>>>>> case
>>>>>>>> for
>>>>>>>> this is graceful shutdown, where you would like a server
to
>>>>>>>> complete
>>>>>>>> all
>>>>>>>> running requests, reject any new ones, and then shut
down,
>>>>>>>> however
>>>>>>>> there
>>>>>>>> are also plenty of other valid use cases (e.g. suspend
the
>>>>>>>> server,
>>>>>>>> modify a data source or some other config, then resume).
>>>>>>>>
>>>>>>>> User View:
>>>>>>>>
>>>>>>>> For the users point of view two new operations will be
added to
>>>>>>>> the
>>>>>>>> server:
>>>>>>>>
>>>>>>>> suspend(timeout)
>>>>>>>> resume()
>>>>>>>>
>>>>>>>> A runtime only attribute suspend-state (is this a good
name?)
>>>>>>>> will
>>>>>>>> also
>>>>>>>> be added, that can take one of three possible values,
RUNNING,
>>>>>>>> SUSPENDING, SUSPENDED.
>>>>>>>>
>>>>>>>> A timeout attribute will also be added to the shutdown
>>>>>>>> operation. If
>>>>>>>> this is present then the server will first be suspended,
and the
>>>>>>>> server
>>>>>>>> will not shut down until either the suspend is successful
or the
>>>>>>>> timeout
>>>>>>>> occurs. If no timeout parameter is passed to the
operation then
>>>>>>>> a
>>>>>>>> normal
>>>>>>>> non-graceful shutdown will take place.
>>>>>>>>
>>>>>>>> In domain mode these operations will be added to both
individual
>>>>>>>> server
>>>>>>>> and a complete server group.
>>>>>>>>
>>>>>>>> Implementation Details
>>>>>>>>
>>>>>>>> Suspend/resume operates on entry points to the server.
Any
>>>>>>>> request
>>>>>>>> that
>>>>>>>> is currently running must not be affected by the suspend
state,
>>>>>>>> however
>>>>>>>> any new request should be rejected. In general subsystems
will
>>>>>>>> track
>>>>>>>> the
>>>>>>>> number of outstanding requests, and when this hits zero
they are
>>>>>>>> considered suspended.
>>>>>>>>
>>>>>>>> We will introduce the notion of a global
SuspendController, that
>>>>>>>> manages
>>>>>>>> the servers suspend state. All subsystems that wish to do
a
>>>>>>>> graceful
>>>>>>>> shutdown register callback handlers with this
controller.
>>>>>>>>
>>>>>>>> When the suspend() operation is invoked the controller
will
>>>>>>>> invoke
>>>>>>>> all
>>>>>>>> these callbacks, letting the subsystem know that the
server is
>>>>>>>> suspend,
>>>>>>>> and providing the subsystem with a SuspendContext object
that
>>>>>>>> the
>>>>>>>> subsystem can then use to notify the controller that the
suspend
>>>>>>>> is
>>>>>>>> complete.
>>>>>>>>
>>>>>>>> What the subsystem does when it receives a suspend
command, and
>>>>>>>> when
>>>>>>>> it
>>>>>>>> considers itself suspended will vary, but in the common
case it
>>>>>>>> will
>>>>>>>> immediatly start rejecting external requests (e.g.
Undertow will
>>>>>>>> start
>>>>>>>> responding with a 503 to all new requests). The subsystem
will
>>>>>>>> also
>>>>>>>> track the number of outstanding requests, and when this
hits
>>>>>>>> zero
>>>>>>>> then
>>>>>>>> the subsystem will notify the controller that is has
>>>>>>>> successfully
>>>>>>>> suspended.
>>>>>>>> Some subsystems will obviously want to do other actions
on
>>>>>>>> suspend,
>>>>>>>> e.g.
>>>>>>>> clustering will likely want to fail over, mod_cluster
will
>>>>>>>> notify the
>>>>>>>> load balancer that the node is no longer available etc.
In some
>>>>>>>> cases
>>>>>>>> we
>>>>>>>> may want to make this configurable to an extent (e.g.
Undertow
>>>>>>>> could
>>>>>>>> be
>>>>>>>> configured to allow requests with an existing session,
and not
>>>>>>>> consider
>>>>>>>> itself timed out until all sessions have either timed out
or
>>>>>>>> been
>>>>>>>> invalidated, although this will obviously take a while).
>>>>>>>>
>>>>>>>> If anyone has any feedback let me know. In terms of
>>>>>>>> implementation my
>>>>>>>> basic plan is to get the core functionality and the
Undertow
>>>>>>>> implementation into Wildfly, and then work with
subsystem
>>>>>>>> authors to
>>>>>>>> implement subsystem specific functionality once the core
is in
>>>>>>>> place.
>>>>>>>>
>>>>>>>> Stuart
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The
>>>>>>>>
>>>>>>>> A timeout attribute will also be added to the shutdown
command,
>>>>>>>> _______________________________________________
>>>>>>>> wildfly-dev mailing list
>>>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>>> _______________________________________________
>>>>>>> wildfly-dev mailing list
>>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>>> _______________________________________________
>>>>>> wildfly-dev mailing list
>>>>>> wildfly-dev(a)lists.jboss.org
>>>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>> _______________________________________________
>>>>> wildfly-dev mailing list
>>>>> wildfly-dev(a)lists.jboss.org
>>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>> --
>>>> Michael Musgrove
>>>> Transactions Team
>>>> e: mmusgrov(a)redhat.com
>>>> t: +44 191 243 0870
>>>>
>>>> Registered in England and Wales under Company Registration No.
>>>> 03798903
>>>> Directors: Michael Cunningham (US), Charles Peters (US), Matt Parson
>>>> (US), Michael O'Neill(Ireland)
>>>>
>>>> _______________________________________________
>>>> wildfly-dev mailing list
>>>> wildfly-dev(a)lists.jboss.org
>>>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>>>>
>> _______________________________________________
>> wildfly-dev mailing list
>> wildfly-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
>
> ---
> Mark Little
> mlittle(a)redhat.com
>
> JBoss, by Red Hat
> Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor,
Berkshire, SI4 1TE, United Kingdom.
> Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael
Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Michael O'Neill
(Ireland).
>
>
>
>
>
>
> _______________________________________________
> wildfly-dev mailing list
> wildfly-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/wildfly-dev
---
Mark Little
mlittle(a)redhat.com
JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor,
Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael
Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Michael O'Neill
(Ireland).