[
https://issues.jboss.org/browse/WFCORE-263?page=com.atlassian.jira.plugin...
]
Brian Stansberry commented on WFCORE-263:
-----------------------------------------
Looking at the list of active operations on the slave HC confirms my expectation as to
what was going on here:
{code}
[domain@localhost:9990 /]
/host=slave/core-service=management/service=management-operations:read-resource(recursive=true,include-runtime=true)
{
"outcome" => "success",
"result" => {"active-operation" => {
"24068422" => {
"access-mechanism" => undefined,
"address" => [
("host" => "slave"),
("server" => "server-one")
],
"caller-thread" => "Host Controller Service Threads -
11",
"cancelled" => false,
"exclusive-running-time" => -1L,
"execution-status" => "executing",
"operation" => "composite",
"running-time" => 78500204000L
},
"1273140711" => {
"access-mechanism" => "undefined",
"address" => [],
"caller-thread" => "Host Controller Service Threads -
10",
"cancelled" => false,
"exclusive-running-time" => 78542647000L,
"execution-status" => "completing",
"operation" => "composite",
"running-time" => 78544658000L
},
"542652524" => {
"access-mechanism" => "undefined",
"address" => [
("host" => "slave"),
("core-service" => "management"),
("service" => "management-operations")
],
"caller-thread" => "Host Controller Service Threads -
17",
"cancelled" => false,
"exclusive-running-time" => -1L,
"execution-status" => "executing",
"operation" => "read-resource",
"running-time" => 4639000L
}
}}
}
{code}
The third op is just the CLI read-resource request itself, so ignore that.
The others are normal for a domain op that is being rolled out to servers and hasn't
yet completed.
The 2nd one (1273140711) was actually invoked first. It was a request from the DC to the
slave telling it to update its own model. It has "execution-status" =>
"completing" because it has prepared the update to the model and is waiting for
an instruction from the DC telling it to commit the transaction. This request is actually
fine.
The 1st one (24068422) is the problematic one. The DC has gotten the prepared notification
from the 1273140711 op and has proceeded to roll out the change to the servers. It sends a
request to the slave which they then proxy on to the servers. The slave is just acting as
a proxy. This is the request that is actually stuck, as the server is not responding.
The problem is find-non-progressing-operation and cancel-non-progressing-operation are
identifying the 1273140711 op as the problematic one.
{code}
[domain@localhost:9990 /]
/host=slave/core-service=management/service=management-operations:find-non-progressing-operation
{
"outcome" => "success",
"result" => "1273140711"
}
{code}
Canceling that one doesn't unstick anything, as the DC is not yet blocking waiting for
a response to its commit/rollback.
I'll need to give some thought as to how to get find-non-progressing-operation and
cancel-non-progressing-operation to identify the 24068422 op as the problematic one, or at
least to decide that they don't know which of the two is the problem, forcing the user
to investigate further.
Cancelling management op on slave HC tree is broken
---------------------------------------------------
Key: WFCORE-263
URL:
https://issues.jboss.org/browse/WFCORE-263
Project: WildFly Core
Issue Type: Bug
Components: Domain Management
Affects Versions: 1.0.0.Alpha9
Reporter: James Livingston
Assignee: Brian Stansberry
Attachments: unundeployable.zip
If you have a DC with a slave HC, and perform a management operation which gets stuck,
non-progressing operations will be reported for both the DC and the slave HC via:
/host=master/core-service=management/service=management-operations:find-non-progressing-operation
/host=slave/core-service=management/service=management-operations:find-non-progressing-operation
Cancelling the operation under /host=master works as expected, pushing the cancellation
down to the slave and the controllers become responsive again.
If however you attempt to cancel the operation under /host=slave, it goes bad. {
"outcome" => "success", "result" => undefined } is
reported in the CLI, but the controllers are still unresponsive.
Running :find-non-progressing-operation against the slave will report the
{outcome=success,result=undefined} rather than that no non-progressing operations were
found, and active-operation=*:read-resource() shows it as not cancelled.
Once you attempt to cancel it on a slave, attempting to cancel it under /host=master will
report success, but leave the slave op in a weird state, and things requiring the
controller lock (such as the web UI) will still not respond.
--
This message was sent by Atlassian JIRA
(v6.3.11#6341)