[JBoss JIRA] (WFCORE-1616) ServerSuspendHandler tries to do step completion and rollback handling in a different thread
by Stuart Douglas (JIRA)
[ https://issues.jboss.org/browse/WFCORE-1616?page=com.atlassian.jira.plugi... ]
Stuart Douglas reassigned WFCORE-1616:
--------------------------------------
Assignee: Stuart Douglas
> ServerSuspendHandler tries to do step completion and rollback handling in a different thread
> --------------------------------------------------------------------------------------------
>
> Key: WFCORE-1616
> URL: https://issues.jboss.org/browse/WFCORE-1616
> Project: WildFly Core
> Issue Type: Bug
> Components: Domain Management
> Reporter: Brian Stansberry
> Assignee: Stuart Douglas
>
> ServerSuspendHandler is calling OperationContext.completeStep() and passing in a rollback handler from a org.jboss.as.server.suspend.OperationListener that may be invoked by a different thread. The OperationContext is not intended to be invoked from multiple threads in this way.
> 3 things can happen with this setup:
> 1) There is no activity preventing suspend, so the suspend controller immediately calls back to the OperationListener on the thread that's handling the operation. So then things work fine.
> 2) There is something that prevents synchronous suspend (i.e. user activity) so then the OperationListener gets invoked later by another thread. The completeStep call registers a rollback handler that will never get called because the op is already done. No harm, no foul unless the operation rolled back.
> 3) Same as 2) but the OperationListener gets invoked later by another thread while the mgmt op is still executing and somehow something goes wrong, since AbstractOperationContext.Step is not written for concurrent modification.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
9 years, 10 months
[JBoss JIRA] (WFCORE-1616) ServerSuspendHandler tries to do step completion and rollback handling in a different thread
by Stuart Douglas (JIRA)
[ https://issues.jboss.org/browse/WFCORE-1616?page=com.atlassian.jira.plugi... ]
Stuart Douglas commented on WFCORE-1616:
----------------------------------------
Blocking was definitely the intention, until shutdown the timeout param in this operation basically just means 'block until the server is suspended or this much time has elapsed' as there is no shutdown op that will be invoked once the operation is complete.
> ServerSuspendHandler tries to do step completion and rollback handling in a different thread
> --------------------------------------------------------------------------------------------
>
> Key: WFCORE-1616
> URL: https://issues.jboss.org/browse/WFCORE-1616
> Project: WildFly Core
> Issue Type: Bug
> Components: Domain Management
> Reporter: Brian Stansberry
>
> ServerSuspendHandler is calling OperationContext.completeStep() and passing in a rollback handler from a org.jboss.as.server.suspend.OperationListener that may be invoked by a different thread. The OperationContext is not intended to be invoked from multiple threads in this way.
> 3 things can happen with this setup:
> 1) There is no activity preventing suspend, so the suspend controller immediately calls back to the OperationListener on the thread that's handling the operation. So then things work fine.
> 2) There is something that prevents synchronous suspend (i.e. user activity) so then the OperationListener gets invoked later by another thread. The completeStep call registers a rollback handler that will never get called because the op is already done. No harm, no foul unless the operation rolled back.
> 3) Same as 2) but the OperationListener gets invoked later by another thread while the mgmt op is still executing and somehow something goes wrong, since AbstractOperationContext.Step is not written for concurrent modification.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
9 years, 10 months
[JBoss JIRA] (WFCORE-1427) Add a timeout param to reload op and use it for "graceful reload"
by Brian Stansberry (JIRA)
[ https://issues.jboss.org/browse/WFCORE-1427?page=com.atlassian.jira.plugi... ]
Brian Stansberry commented on WFCORE-1427:
------------------------------------------
[~yersan] Re: ReloadHandler what you propose makes sense to me.
Perhaps a tweak on that would be in the "catch (IOException e)" block in doHandle. If (cliClient.isConnected() == false) (the missing else case of the existing code) then the server has closed the connection, which means the reload has proceeded, and therefore the suspend is done. So then there is no reason to wait for the suspend.
> Add a timeout param to reload op and use it for "graceful reload"
> -----------------------------------------------------------------
>
> Key: WFCORE-1427
> URL: https://issues.jboss.org/browse/WFCORE-1427
> Project: WildFly Core
> Issue Type: Enhancement
> Components: CLI, Domain Management
> Reporter: Brian Stansberry
> Assignee: Yeray Santana Borges
>
> So instead of
> {code}
> :suspend(20)
> :reload
> {code}
> It's just
> {code}
> :reload(20)
> {code}
> The high level 'reload' command in the CLI should take a --timeout param as well.
> If doing the graceful suspend as part of server side ":reload" handling proves problematic (I haven't looked into it at all before filing this) then a simpler alternative is to only go with the --timeout param on the CLI reload command, and have the CLI implement the graceful behavior internally by first calling :suspend and then :reload. Web console could do the same thing.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
9 years, 10 months
[JBoss JIRA] (WFLY-6749) Cluster failover doesn't work on windows when network is disabled on a node
by Paul Ferraro (JIRA)
[ https://issues.jboss.org/browse/WFLY-6749?page=com.atlassian.jira.plugin.... ]
Paul Ferraro edited comment on WFLY-6749 at 6/22/16 5:07 PM:
-------------------------------------------------------------
The default stack contains the following failure detection protocols:
* FD_SOCK
* FD_ALL
These protocols are described here:
http://www.jgroups.org/manual/index.html#FailureDetection
I suspect that your method of simulating a failure - by disabling the network of the host machine is not being detected by FD_SOCK. It will however, be detected by FD_ALL, but only after 1 minute. The heartbeat timeout used by FD_ALL can be manipulated via the timeout property.
e.g.
<protocol type="FD_ALL" ><property name="timeout">60000</property></protocol>
was (Author: pferraro):
The default stack contains the following failure detection protocols:
* FD_SOCK
* FD_ALL
These protocols are described here:
http://www.jgroups.org/manual/index.html#FailureDetection
I suspect that your method of simulating a failure - by disabling the network of the host machine is not being detected by FD_SOCK. It will however, be detected by FD_ALL, but only after 1 minute. The heartbeat timeout used by FD_ALL can be manipulated via the timeout property.
e.g.
<protocol type="FD_ALL" ><property name="timeout">60000</property></protocol>
> Cluster failover doesn't work on windows when network is disabled on a node
> ---------------------------------------------------------------------------
>
> Key: WFLY-6749
> URL: https://issues.jboss.org/browse/WFLY-6749
> Project: WildFly
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 8.2.0.Final
> Reporter: Preeta Kuruvilla
> Assignee: Paul Ferraro
> Priority: Critical
>
> This is about a two VM Wildfly cluster on windows environment. In order to test the failover, the team has disabled the network on one node. However the failover is not happening and the application functionality on the cluster is hampered as a result.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
9 years, 10 months
[JBoss JIRA] (WFLY-6749) Cluster failover doesn't work on windows when network is disabled on a node
by Paul Ferraro (JIRA)
[ https://issues.jboss.org/browse/WFLY-6749?page=com.atlassian.jira.plugin.... ]
Paul Ferraro commented on WFLY-6749:
------------------------------------
The default stack contains the following failure detection protocols:
* FD_SOCK
* FD_ALL
These protocols are described here:
http://www.jgroups.org/manual/index.html#FailureDetection
I suspect that your method of simulating a failure - by disabling the network of the host machine is not being detected by FD_SOCK. It will however, be detected by FD_ALL, but only after 1 minute. The heartbeat timeout used by FD_ALL can be manipulated via the timeout property.
e.g.
<protocol type="FD_ALL" ><property name="timeout">60000</property></protocol>
> Cluster failover doesn't work on windows when network is disabled on a node
> ---------------------------------------------------------------------------
>
> Key: WFLY-6749
> URL: https://issues.jboss.org/browse/WFLY-6749
> Project: WildFly
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 8.2.0.Final
> Reporter: Preeta Kuruvilla
> Assignee: Paul Ferraro
> Priority: Critical
>
> This is about a two VM Wildfly cluster on windows environment. In order to test the failover, the team has disabled the network on one node. However the failover is not happening and the application functionality on the cluster is hampered as a result.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
9 years, 10 months
[JBoss JIRA] (WFCORE-1612) Allow applying the same configuration changes to multiple Standalone nodes
by Brian Stansberry (JIRA)
[ https://issues.jboss.org/browse/WFCORE-1612?page=com.atlassian.jira.plugi... ]
Brian Stansberry commented on WFCORE-1612:
------------------------------------------
[~sebastian.laskawiec] Please leave this open. I'm not sure why but I have an instinct having this will be helpful somehow, and if not I can just close it later.
> Allow applying the same configuration changes to multiple Standalone nodes
> --------------------------------------------------------------------------
>
> Key: WFCORE-1612
> URL: https://issues.jboss.org/browse/WFCORE-1612
> Project: WildFly Core
> Issue Type: Feature Request
> Reporter: Sebastian Łaskawiec
> Assignee: Brian Stansberry
>
> I would like to request a supporting configuration changes to multiple standalone nodes the same way as currently DMR in domain mode works.
> The main motivation behind that is using WF based Projects (like Infinispan HotRod Server) in the Cloud. By using DMR we can ensure changing configuration on the fly, which is very important in our case.
> In Infinispan project we use DMR to propagate configuration changes to all nodes in the cluster (e.g. adding a cache, changing it etc). When considering Cloud deployment we use Standalone mode. It would be nice to find a way to propagate configuration changes to all servers the same way as we do in Domain mode.
--
This message was sent by Atlassian JIRA
(v6.4.11#64026)
9 years, 10 months