[JBoss JIRA] (WFCORE-1225) Domain rollout operation timeouts
by Brian Stansberry (JIRA)
Brian Stansberry created WFCORE-1225:
----------------------------------------
Summary: Domain rollout operation timeouts
Key: WFCORE-1225
URL: https://issues.jboss.org/browse/WFCORE-1225
Project: WildFly Core
Issue Type: Feature Request
Components: Domain Management
Reporter: Brian Stansberry
Assignee: Brian Stansberry
Fix For: 3.0.0.Alpha1
Portion of the parent task related to rolling out changes to a domain.
The expectation is that single process timeouts (WFLY-2741) will handle most failure conditions related to domain rollouts (e.g. if a single server hangs, preventing completion of the rollout, eventually that server will time out, allowing the domain wide rollout to continue.) Timeouts in the domain rollout code serve as a second line of defense:
1) In case of protocol or other problems that prevent the calling process learning about the timeout on the remote process
2) In case of bugs in the single process timeout handling on the remote process
3) In mixed domain cases where remote hosts are running previous versions and do not have the timeout function
Potential places to add timeouts:
DomainSlaveHandler->HostControllerUpdateTask.ProxyOperationListener.retrievePreparedOperation()
-- where the master HC waits for responses from slaves
RollingServerGroupUpdateTask.run() -> ServerTaskExecutor.ServerOperationListener.retrievePreparedOperation()
-- timeout here means 1 server didn't respond, but need to move on to next
ConcurrentServerGroupUpdateTask.run() -> ServerTaskExecutor.ServerOperationListener.retrievePreparedOperation()
-- timeout here means none of the remaining servers have responded w/in the timeout
DomainRolloutStepHandler.finalizeOp() -> future.get()
---- the ServerGroupUpdateTask should fail in the normal phase, so any timeout here would indicate a problem committing the tx or a comms problem getting back the response
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-378) Domain rollout operation timeouts
by Brian Stansberry (JIRA)
[ https://issues.jboss.org/browse/WFCORE-378?page=com.atlassian.jira.plugin... ]
Brian Stansberry updated WFCORE-378:
------------------------------------
Issue Type: Bug (was: Feature Request)
> Domain rollout operation timeouts
> ---------------------------------
>
> Key: WFCORE-378
> URL: https://issues.jboss.org/browse/WFCORE-378
> Project: WildFly Core
> Issue Type: Bug
> Components: Domain Management
> Reporter: Brian Stansberry
> Assignee: Brian Stansberry
> Fix For: 3.0.0.Alpha1
>
>
> Portion of the parent task related to rolling out changes to a domain.
> The expectation is that single process timeouts (WFLY-2741) will handle most failure conditions related to domain rollouts (e.g. if a single server hangs, preventing completion of the rollout, eventually that server will time out, allowing the domain wide rollout to continue.) Timeouts in the domain rollout code serve as a second line of defense:
> 1) In case of protocol or other problems that prevent the calling process learning about the timeout on the remote process
> 2) In case of bugs in the single process timeout handling on the remote process
> 3) In mixed domain cases where remote hosts are running previous versions and do not have the timeout function
> Potential places to add timeouts:
> DomainSlaveHandler->HostControllerUpdateTask.ProxyOperationListener.retrievePreparedOperation()
> -- where the master HC waits for responses from slaves
> RollingServerGroupUpdateTask.run() -> ServerTaskExecutor.ServerOperationListener.retrievePreparedOperation()
> -- timeout here means 1 server didn't respond, but need to move on to next
> ConcurrentServerGroupUpdateTask.run() -> ServerTaskExecutor.ServerOperationListener.retrievePreparedOperation()
> -- timeout here means none of the remaining servers have responded w/in the timeout
> DomainRolloutStepHandler.finalizeOp() -> future.get()
> ---- the ServerGroupUpdateTask should fail in the normal phase, so any timeout here would indicate a problem committing the tx or a comms problem getting back the response
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-1225) Domain rollout operation timeouts
by Brian Stansberry (JIRA)
[ https://issues.jboss.org/browse/WFCORE-1225?page=com.atlassian.jira.plugi... ]
Brian Stansberry updated WFCORE-1225:
-------------------------------------
Issue Type: Bug (was: Feature Request)
> Domain rollout operation timeouts
> ---------------------------------
>
> Key: WFCORE-1225
> URL: https://issues.jboss.org/browse/WFCORE-1225
> Project: WildFly Core
> Issue Type: Bug
> Components: Domain Management
> Reporter: Brian Stansberry
> Assignee: Brian Stansberry
> Fix For: 3.0.0.Alpha1
>
>
> Portion of the parent task related to rolling out changes to a domain.
> The expectation is that single process timeouts (WFLY-2741) will handle most failure conditions related to domain rollouts (e.g. if a single server hangs, preventing completion of the rollout, eventually that server will time out, allowing the domain wide rollout to continue.) Timeouts in the domain rollout code serve as a second line of defense:
> 1) In case of protocol or other problems that prevent the calling process learning about the timeout on the remote process
> 2) In case of bugs in the single process timeout handling on the remote process
> 3) In mixed domain cases where remote hosts are running previous versions and do not have the timeout function
> Potential places to add timeouts:
> DomainSlaveHandler->HostControllerUpdateTask.ProxyOperationListener.retrievePreparedOperation()
> -- where the master HC waits for responses from slaves
> RollingServerGroupUpdateTask.run() -> ServerTaskExecutor.ServerOperationListener.retrievePreparedOperation()
> -- timeout here means 1 server didn't respond, but need to move on to next
> ConcurrentServerGroupUpdateTask.run() -> ServerTaskExecutor.ServerOperationListener.retrievePreparedOperation()
> -- timeout here means none of the remaining servers have responded w/in the timeout
> DomainRolloutStepHandler.finalizeOp() -> future.get()
> ---- the ServerGroupUpdateTask should fail in the normal phase, so any timeout here would indicate a problem committing the tx or a comms problem getting back the response
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-1224) Try closing the channel if java.lang.Error prevents sending error responses to the client
by Brian Stansberry (JIRA)
[ https://issues.jboss.org/browse/WFCORE-1224?page=com.atlassian.jira.plugi... ]
Brian Stansberry deleted WFCORE-1224:
-------------------------------------
> Try closing the channel if java.lang.Error prevents sending error responses to the client
> -----------------------------------------------------------------------------------------
>
> Key: WFCORE-1224
> URL: https://issues.jboss.org/browse/WFCORE-1224
> Project: WildFly Core
> Issue Type: Sub-task
> Reporter: Brian Stansberry
> Assignee: Brian Stansberry
>
> Beyond the basic work on WFCORE-1134, look into further reaction when Errors occur and the server or HC cannot even send an error response to the caller. If we get to this point, the task has already failed to handle a problem and now we can't notify the remote side either. Most likely this is an OOME situation, although any Error here means a major issue.
> Options:
> 1) Try and close the channel to disconnect this process from the remote end so it doesn't disrupt the remote end. If this is an intra-HC or HC-server connection a successful close will remove this process from normal domain control. If this is a server the HC still has some control over the server via the ProcessController, e.g. to handle a 'kill' or 'destroy' management op. If this is a slave HC, the slave is disconnected from the domain. Either a server or a slave HC may try to reconnect, although it's likely better if that fails and the user just restarts the process.
> If the remote side is an end user client (e.g. CLI) then a successful close will be noticed by the client. The user can reconnect or take action to terminate this process.
> 2) Commit suicide via SystemExiter.exit. But I'm not certain complete termination of the process is how we want to deal with problems with management requests. Probably some sort of configurable policy would be better.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-1223) A java.lang.Error in Stage.DONE can lead to hangs
by Brian Stansberry (JIRA)
Brian Stansberry created WFCORE-1223:
----------------------------------------
Summary: A java.lang.Error in Stage.DONE can lead to hangs
Key: WFCORE-1223
URL: https://issues.jboss.org/browse/WFCORE-1223
Project: WildFly Core
Issue Type: Bug
Components: Domain Management
Affects Versions: 2.0.1.Final
Reporter: Brian Stansberry
Assignee: Brian Stansberry
Fix For: 3.0.0.Alpha1
As part of investigating https://bugzilla.redhat.com/show_bug.cgi?id=1259767 I've looked into general handling of situations when java.lang.Error is thrown during management operation execution.
Handling prior to Stage.DONE looks ok, with the error caught, logged and the failure-description on the response set. But if the error happens after commit or during rollback, it isn't always properly triggering a response. In the intra-domain case the remote node that has the problem tries to send a failure response to the calling HC, but uses the wrong request id, resulting in that response being dropped and the calling HC continuing to wait.
A post-commit Error seems pretty unlikely. A rollback error seems more possible, as a rollback may involve installing significant numbers of services.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-1224) Try closing the channel if java.lang.Error prevents sending error responses to the client
by Brian Stansberry (JIRA)
Brian Stansberry created WFCORE-1224:
----------------------------------------
Summary: Try closing the channel if java.lang.Error prevents sending error responses to the client
Key: WFCORE-1224
URL: https://issues.jboss.org/browse/WFCORE-1224
Project: WildFly Core
Issue Type: Sub-task
Components: Domain Management
Reporter: Brian Stansberry
Assignee: Brian Stansberry
Fix For: 3.0.0.Alpha1
Beyond the basic work on WFCORE-1134, look into further reaction when Errors occur and the server or HC cannot even send an error response to the caller. If we get to this point, the task has already failed to handle a problem and now we can't notify the remote side either. Most likely this is an OOME situation, although any Error here means a major issue.
Options:
1) Try and close the channel to disconnect this process from the remote end so it doesn't disrupt the remote end. If this is an intra-HC or HC-server connection a successful close will remove this process from normal domain control. If this is a server the HC still has some control over the server via the ProcessController, e.g. to handle a 'kill' or 'destroy' management op. If this is a slave HC, the slave is disconnected from the domain. Either a server or a slave HC may try to reconnect, although it's likely better if that fails and the user just restarts the process.
If the remote side is an end user client (e.g. CLI) then a successful close will be noticed by the client. The user can reconnect or take action to terminate this process.
2) Commit suicide via SystemExiter.exit. But I'm not certain complete termination of the process is how we want to deal with problems with management requests. Probably some sort of configurable policy would be better.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-1222) Graceful shutdown is interfering with the 'kill' and 'destroy' operations
by Brian Stansberry (JIRA)
Brian Stansberry created WFCORE-1222:
----------------------------------------
Summary: Graceful shutdown is interfering with the 'kill' and 'destroy' operations
Key: WFCORE-1222
URL: https://issues.jboss.org/browse/WFCORE-1222
Project: WildFly Core
Issue Type: Bug
Components: Domain Management
Affects Versions: 2.0.4.Final
Reporter: Brian Stansberry
Assignee: Brian Stansberry
Fix For: 2.0.5.Final
The 'kill' and 'destroy' operations on the HC are meant to force shutdown of misbehaving servers. But the graceful shutdown work ([1]) has introduced a management op into the mix. I believe that should be removed, as its not the intent of these operations to try and be graceful; the regular 'stop' ops are for that.
When experimenting with how domains react to OOME servers as part of my WFCORE-378 work I'm seeing 'kill' and 'destroy' no longer function because the OOME on the server means the graceful shutdown management op hangs.
[1] https://github.com/wildfly/wildfly-core/commit/6e95b5#diff-ecdfa997cd57af...
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months
[JBoss JIRA] (WFCORE-1221) Operation headers not propagated to domain servers when 'composite' op is used
by Brian Stansberry (JIRA)
Brian Stansberry created WFCORE-1221:
----------------------------------------
Summary: Operation headers not propagated to domain servers when 'composite' op is used
Key: WFCORE-1221
URL: https://issues.jboss.org/browse/WFCORE-1221
Project: WildFly Core
Issue Type: Bug
Components: Domain Management
Affects Versions: 2.0.4.Final
Reporter: Brian Stansberry
Assignee: Brian Stansberry
Priority: Critical
Fix For: 2.0.5.Final
When the user adds request headers to an op, they are not propagated to the servers during domain rollout if the 'composite' op is involved.
For example, if I add some stdout printing of what the headers are on the various processes and invoke this:
{code}
[domain@localhost:9990 /] deploy ~/tmp/helloworld.war --headers={blocking-timeout=5;rollback-on-runtime-failure=false} --all-server-groups
{code}
Then on a HC with two servers, this is logged:
[Host Controller] 10:53:40,697 INFO [stdout] (management-handler-thread - 3) "composite" headers: {
[Host Controller] 10:53:40,697 INFO [stdout] (management-handler-thread - 3) "blocking-timeout" => "5",
[Host Controller] 10:53:40,698 INFO [stdout] (management-handler-thread - 3) "rollback-on-runtime-failure" => "false",
[Host Controller] 10:53:40,698 INFO [stdout] (management-handler-thread - 3) "caller-type" => "user",
[Host Controller] 10:53:40,698 INFO [stdout] (management-handler-thread - 3) "access-mechanism" => "NATIVE"
[Host Controller] 10:53:40,698 INFO [stdout] (management-handler-thread - 3) }
[Host Controller] 10:53:40,727 INFO [org.jboss.as.repository] (management-handler-thread - 3) WFLYDR0001: Content added at location /Users/bstansberry/dev/wildfly/wildfly-core/dist/target/wildfly-core-2.0.5.Final-SNAPSHOT/domain/data/content/6f/cd9eae343ed6d5aa9fffa83012d155b1ef911c/content
[Server:server-one] 10:53:40,772 INFO [stdout] (ServerService Thread Pool -- 11) "composite" headers: null
[Server:server-two] 10:53:40,772 INFO [stdout] (ServerService Thread Pool -- 11) "composite" headers: null
The HC logs, then the servers report. The user-specified headers are not included.
Invoke the same op without the batch and this is logged:
{code}
[Host Controller] 10:43:50,400 INFO [stdout] (management-handler-thread - 4) "composite" headers: {
[Host Controller] 10:43:50,401 INFO [stdout] (management-handler-thread - 4) "blocking-timeout" => "5",
[Host Controller] 10:43:50,401 INFO [stdout] (management-handler-thread - 4) "rollback-on-runtime-failure" => "false",
[Host Controller] 10:43:50,401 INFO [stdout] (management-handler-thread - 4) "caller-type" => "user",
[Host Controller] 10:43:50,401 INFO [stdout] (management-handler-thread - 4) "access-mechanism" => "NATIVE"
[Host Controller] 10:43:50,401 INFO [stdout] (management-handler-thread - 4) }
[Host Controller] 10:43:50,425 INFO [org.jboss.as.repository] (management-handler-thread - 4) WFLYDR0001: Content added at location /Users/bstansberry/dev/wildfly/wildfly-core/dist/target/wildfly-core-2.0.5.Final-SNAPSHOT/domain/data/content/6f/cd9eae343ed6d5aa9fffa83012d155b1ef911c/content
[Server:server-two] 10:43:50,464 INFO [stdout] (ServerService Thread Pool -- 11) "composite" headers: {
[Server:server-two] 10:43:50,464 INFO [stdout] (ServerService Thread Pool -- 11) "blocking-timeout" => "5",
[Server:server-two] 10:43:50,464 INFO [stdout] (ServerService Thread Pool -- 11) "rollback-on-runtime-failure" => "false",
[Server:server-one] 10:43:50,464 INFO [stdout] (ServerService Thread Pool -- 11) "composite" headers: {
[Server:server-two] 10:43:50,464 INFO [stdout] (ServerService Thread Pool -- 11) "access-mechanism" => "NATIVE",
[Server:server-one] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) "blocking-timeout" => "5",
[Server:server-two] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) "domain-uuid" => "216d2e99-dba5-4c89-8020-b0c16bd553c5"
[Server:server-one] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) "rollback-on-runtime-failure" => "false",
[Server:server-two] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) }
[Server:server-one] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) "access-mechanism" => "NATIVE",
[Server:server-one] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) "domain-uuid" => "216d2e99-dba5-4c89-8020-b0c16bd553c5"
[Server:server-one] 10:43:50,465 INFO [stdout] (ServerService Thread Pool -- 11) }
{code}
Expected headers are present.
Note the CLI 'deploy' is far from the only time the 'composite' op is used. Among other places, the high level CLI 'batch' command in a domain involves use of 'composite'.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
10 years, 5 months