[jboss-jira] [JBoss JIRA] (WFCORE-263) Cancelling management op on slave HC tree is broken

Tue Dec 23 22:18:30 EST 2014

    [ https://issues.jboss.org/browse/WFCORE-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029506#comment-13029506 ] 

Brian Stansberry commented on WFCORE-263:
-----------------------------------------

Looking at the list of active operations on the slave HC confirms my expectation as to what was going on here:

{code}
[domain at localhost:9990 /] /host=slave/core-service=management/service=management-operations:read-resource(recursive=true,include-runtime=true)
{
    "outcome" => "success",
    "result" => {"active-operation" => {
        "24068422" => {
            "access-mechanism" => undefined,
            "address" => [
                ("host" => "slave"),
                ("server" => "server-one")
            ],
            "caller-thread" => "Host Controller Service Threads - 11",
            "cancelled" => false,
            "exclusive-running-time" => -1L,
            "execution-status" => "executing",
            "operation" => "composite",
            "running-time" => 78500204000L
        },
        "1273140711" => {
            "access-mechanism" => "undefined",
            "address" => [],
            "caller-thread" => "Host Controller Service Threads - 10",
            "cancelled" => false,
            "exclusive-running-time" => 78542647000L,
            "execution-status" => "completing",
            "operation" => "composite",
            "running-time" => 78544658000L
        },
        "542652524" => {
            "access-mechanism" => "undefined",
            "address" => [
                ("host" => "slave"),
                ("core-service" => "management"),
                ("service" => "management-operations")
            ],
            "caller-thread" => "Host Controller Service Threads - 17",
            "cancelled" => false,
            "exclusive-running-time" => -1L,
            "execution-status" => "executing",
            "operation" => "read-resource",
            "running-time" => 4639000L
        }
    }}
}
{code}

The third op is just the CLI read-resource request itself, so ignore that.

The others are normal for a domain op that is being rolled out to servers and hasn't yet completed.

The 2nd one (1273140711) was actually invoked first. It was a request from the DC to the slave telling it to update its own model. It has "execution-status" => "completing" because it has prepared the update to the model and is waiting for an instruction from the DC telling it to commit the transaction. This request is actually fine.

The 1st one (24068422) is the problematic one. The DC has gotten the prepared notification from the 1273140711 op and has proceeded to roll out the change to the servers. It sends a request to the slave which they then proxy on to the servers. The slave is just acting as a proxy. This is the request that is actually stuck, as the server is not responding.

The problem is find-non-progressing-operation and cancel-non-progressing-operation are identifying the 1273140711 op as the problematic one.

{code}
[domain at localhost:9990 /] /host=slave/core-service=management/service=management-operations:find-non-progressing-operation
{
    "outcome" => "success",
    "result" => "1273140711"
}
{code}

Canceling that one doesn't unstick anything, as the DC is not yet blocking waiting for a response to its commit/rollback.

I'll need to give some thought as to how to get find-non-progressing-operation and cancel-non-progressing-operation to identify the 24068422 op as the problematic one, or at least to decide that they don't know which of the two is the problem, forcing the user to investigate further.

> Cancelling management op on slave HC tree is broken
> ---------------------------------------------------
>
>                 Key: WFCORE-263
>                 URL: https://issues.jboss.org/browse/WFCORE-263
>             Project: WildFly Core
>          Issue Type: Bug
>          Components: Domain Management
>    Affects Versions: 1.0.0.Alpha9
>            Reporter: James Livingston
>            Assignee: Brian Stansberry
>         Attachments: unundeployable.zip
>
>
> If you have a DC with a slave HC, and perform a management operation which gets stuck, non-progressing operations will be reported for both the DC and the slave HC via:
> /host=master/core-service=management/service=management-operations:find-non-progressing-operation
> /host=slave/core-service=management/service=management-operations:find-non-progressing-operation
> Cancelling the operation under /host=master works as expected, pushing the cancellation down to the slave and the controllers become responsive again.
> If however you attempt to cancel the operation under /host=slave, it goes bad. { "outcome" => "success", "result" => undefined } is reported in the CLI, but the controllers are still unresponsive.
> Running :find-non-progressing-operation against the slave will report the {outcome=success,result=undefined} rather than that no non-progressing operations were found, and active-operation=*:read-resource() shows it as not cancelled.
> Once you attempt to cancel it on a slave, attempting to cancel it under /host=master will report success, but leave the slave op in a weird state, and things requiring the controller lock (such as the web UI) will still not respond.

--
This message was sent by Atlassian JIRA
(v6.3.11#6341)