tl;dr question is how to disable 'graceful startup'. Skip the background if
you know what that means. :)
Back in 2016 when we added the feature to allow a server to be started in
'suspended' state, that work also included a fix for the longstanding
bug whereby during server start endpoints would be started and accepting
external requests before all the services (e.g. from deployments) would be
started. The result would be requests could reach the still-starting server
and would fail, e.g. HTTP requests might get a 404 or some variety of 500.
I refer to this bug fix as 'graceful startup'.
Since the fix was introduced we've gotten quite a number of requests to be
able to turn off that bug fix, e.g. WFCORE-4291. The scenario is users
deploy two apps, where app A during start makes an *external* request to
app B and won't complete start until that request is handled. And, the
users deploy both A and B in the same server. The server won't allow the
external request during boot, so A won't complete start and thus the
overall server start hangs until timeout.
I consider this kind of deployment pattern to be a bit of an anti-pattern,
but we've gotten enough request to allow it that I'm looking into how to
satisfy it. Also, at least for HTTP requests, mod_cluster can be used to
prevent external requests reaching a server before things are ready, so if
the 'internal' requests were not sent through the LB there's at least one
'error free' use case for this.
Question is whether to
a) have an overall config switch to disable graceful startup across the
board (e.g. a new value for the --start-mode cmd line param passed to
b) have a subsystem specific setting in the undertow subsystem that
configures undertow to allow requests in during boot.
Pros of a)
* Other request patterns are also handled. For example, if our app A was
making a remote EJB call to app B, then an undertow only setting won't
handle it. If we start adding multiple per-subsystem flags it gets ugly.
* Requests to web applications may still fail, as there are other aspects
of the server that are rejecting certain calls until 'graceful startup' is
complete. For example ee-concurrency rejects adding scheduled tasks
(although that is somewhat a bug), and the XTS integration looks to be
designed to reject certain requests. There may be others. If we have
make web requests an exceptional pattern, going forward we have to account
for that pattern in everything.
* The undertow subsystem itself has two different mechanisms for rejecting
requests, with three different call patterns, all of which would need to be
Pros of b)
* It limits the change to the HTTP use case, the one where we know
mod_cluster can be used to prevent external requests.
* I'm not sure about the batch subsystem; i.e. whether it is ok to have
batch jobs starting before server start is complete. If the relevant
services all have MSC dependencies on everthing they need it should be ok.
If not there needs to be some adaptation listen for when the server is
fully started, which seems doable.
* There may be code that is using this 'graceful startup' as a way not to
prevent end user activity, but to prevent premature internal server
activity. I think RecoverySuspendController may be an example of this; i.e.
preventing start of the tx recovery thread until the server is started. But
for this kind of thing there are other, better solutions.
Right now my preference is a), a global switch. If we're doing this I'm not
inclined to limit it to HTTP only as I expect we'll just have to revisit it
later. And I think I know how to deal with the more technical pros of the