tl;dr question is how to disable 'graceful startup'. Skip the background if you know what that means. :)

Background


Back in 2016 when we added the feature to allow a server to be started in 'suspended' state[1], that work also included a fix for the longstanding bug whereby during server start endpoints would be started and accepting external requests before all the services (e.g. from deployments) would be started. The result would be requests could reach the still-starting server and would fail, e.g. HTTP requests might get a 404 or some variety of 500.

I refer to this bug fix as 'graceful startup'.

Since the fix was introduced we've gotten quite a number of requests to be able to turn off that bug fix, e.g. WFCORE-4291.[2] The scenario is users deploy two apps, where app A during start makes an *external* request to app B and won't complete start until that request is handled. And, the users deploy both A and B in the same server. The server won't allow the external request during boot, so A won't complete start and thus the overall server start hangs until timeout.

I consider this kind of deployment pattern to be a bit of an anti-pattern, but we've gotten enough request to allow it that I'm looking into how to satisfy it. Also, at least for HTTP requests, mod_cluster can be used to prevent external requests reaching a server before things are ready, so if the 'internal' requests were not sent through the LB there's at least one 'error free' use case for this.


The Question

Question is whether to 

a) have an overall config switch to disable graceful startup across the board (e.g. a new value for the --start-mode cmd line param passed to standalone.sh)

b) have a subsystem specific setting in the undertow subsystem that configures undertow to allow requests in during boot.

Pros of a)

* Other request patterns are also handled. For example, if our app A was making a remote EJB call to app B, then an undertow only setting won't handle it. If we start adding multiple per-subsystem flags it gets ugly.
* Requests to web applications may still fail, as there are other aspects of the server that are rejecting certain calls until 'graceful startup' is complete. For example ee-concurrency rejects adding scheduled tasks (although that is somewhat a bug[3]), and the XTS integration looks to be designed to reject certain requests.[4] There may be others. If we have make web requests an exceptional pattern, going forward we have to account for that pattern in everything.
* The undertow subsystem itself has two different mechanisms for rejecting requests, with three different call patterns, all of which would need to be adapted.

Pros of b)

* It limits the change to the HTTP use case, the one where we know mod_cluster can be used to prevent external requests.
* I'm not sure about the batch subsystem; i.e. whether it is ok to have batch jobs starting before server start is complete. If the relevant services all have MSC dependencies on everthing they need it should be ok. If not there needs to be some adaptation listen for when the server is fully started, which seems doable.
* There may be code that is using this 'graceful startup' as a way not to prevent end user activity, but to prevent premature internal server activity. I think RecoverySuspendController may be an example of this; i.e. preventing start of the tx recovery thread until the server is started. But for this kind of thing there are other, better solutions.


Right now my preference is a), a global switch. If we're doing this I'm not inclined to limit it to HTTP only as I expect we'll just have to revisit it later. And I think I know how to deal with the more technical pros of the http-only approach.

WDYT?




Best regards,
Brian