[jboss-jira] [JBoss JIRA] (WFWIP-241) scale down of client pod that has transaction in-doubt on it isn't successful when there is server pod that is part of transaction which isn't reachable

Thu Oct 10 04:40:00 EDT 2019

    [ https://issues.jboss.org/browse/WFWIP-241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797175#comment-13797175 ] 

Ondrej Chaloupka commented on WFWIP-241:
----------------------------------------

This issue connects to a WFC-77 but only in regards of a similar test case and in fact that the folder {{/opt/eap/standalone/data/ejb-xa-recovery}} is not cleaned during recovery and as it's not cleaned it prevents the scaledown could be processed.

But this issue has nothing with an error in the WFTC component (as it's in big probability the case of the WFTC-77). The trouble is that the {{tx-client}} has no idea on how to finish with the {{ejb-xa-recovery}} records. It does not know if the {{tx-server}} was cleanly scaled-down (in such case the record may be removed) or if the {{tx-server}} is temporarily down and it will be resurrected and some work is needed to be done (in which case the record has to be left living and the {{tx-client}} can't be scaled-down).

The scenario is following.
{{tx-client}} runs the 2PC {{PREPARE}} at the {{tx-server}}. {{tx-server}} calss prepare at the remote resource (e.g. database) and crashes. The 2PC {{PREPARE}} phase was not finished at the {{tx-server}} and thus the rollback is assumed.
After the restart of the {{tx-server}} the recovery process may rollback all data in the database as the rollback is assumed (the prepare was not finished, there is no promise about finishing with commit). The {{tx-server}} cleans all resources and is permitted to scaled-down (all data was rolled-back, data consistency is fine).
But now when {{tx-server}} went away the {{tx-client}} stays with the record at the {{ejb-xa-recovery}}. When the {{tx-server}} will be restarted and availale the {{tx-client}} would call {{recover}}, finds out the {{tx-server}} has no indoubt work and the record would be deleted.
But now the {{tx-server}} is shutdown and {{tx-client}} has no idea how to finish with the record. There needs to be some 3rd party process from e.g. WFLY operator to say that there is no need of further existence of the record. For this would working probably some new {{jboss-cli.sh}} command would be needed. We need to find out what are the unfinished remote resources from perspective of WFTC and then we need to say that those which are still in the repository may be ignored as servers are already not in existence.

> scale down of client pod that has transaction in-doubt on it isn't successful when there is server pod that is part of transaction which isn't reachable
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: WFWIP-241
>                 URL: https://issues.jboss.org/browse/WFWIP-241
>             Project: WildFly WIP
>          Issue Type: Bug
>          Components: OpenShift
>            Reporter: Martin Simka
>            Assignee: Ondrej Chaloupka
>            Priority: Major
>              Labels: operator
>
> While testing tx recovery in OpenShift I see that scale down of client pod that has transaction in-doubt on it isn't successful when there is server pod that is part of transaction which isn't reachable
> Scenario:
> *ejb client* (app tx-client, pod tx-client-0):
> * EJB business method
>   ** lookup remote EJB 
>   ** enlist XA resource 1 to transaction
>   ** enlist XA resource 2 to transaction
>   ** call remote EJB
> *ejb server* (app tx-server, pod tx-server-0):
> * EJB business method
>   **  enlist XA resource 1 to transaction
>   ** enlist XA resource 2 to transaction
> *testTxStatelessServerSecondPrepareJvmHaltScaleDownClient*
> Test workflow:
> - ejb server XA resource crashes JVM on tx-server pod
> - label "{{wildfly.org/operated-by-headless}}" of server pod is changed, which causes that server isn't reachable
> - tx-server pod is scaled down
> - tx-client pod is scaled down
> *scale down of client pod hangs*
> {noformat}
> {"level":"info","ts":1570636701.0314589,"logger":"wildflyserver_controller","msg":"Scaling down statefulset by verification if pods are clean by recovery","StatefulSet.Namespace":"msimka-namespace","StatefulSet.Name":"tx-client"}
> {"level":"info","ts":1570636701.0314867,"logger":"wildflyserver_controller","msg":"Statefulset was not scaled to the desired replica size 0 (current StatefulSet size: 1). Transaction recovery scaledown process has not cleaned all pods. Please, check status of the WildflyServer tx-client","StatefulSet.Namespace":"msimka-namespace","StatefulSet.Name":"tx-client"}
> {"level":"info","ts":1570636703.5918324,"logger":"wildflyserver_controller","msg":"Reconciling WildFlyServer","Request.Namespace":"msimka-namespace","Request.Name":"tx-client"}
> {"level":"info","ts":1570636703.5919785,"logger":"wildlfyserver_resources","msg":"Getting resource","WildFlyServer.Namespace":"msimka-namespace","WildFlyServer.Name":"tx-client","Resource.Name":"tx-client"}
> {"level":"info","ts":1570636703.5920458,"logger":"wildlfyserver_resources","msg":"Got resource","WildFlyServer.Namespace":"msimka-namespace","WildFlyServer.Name":"tx-client","Resource.Name":"tx-client"}
> {"level":"info","ts":1570636703.5921679,"logger":"wildflyserver_controller","msg":"Transaction recovery scaledown processing","Request.Namespace":"msimka-namespace","Request.Name":"tx-client","Pod Name":"tx-client-0","IP Address":"10.128.1.34","Pod State":"SCALING_DOWN_RECOVERY_DIRTY","Pod Phase":"Running"}
> {"level":"info","ts":1570636703.5922475,"logger":"wildflyserver_controller","msg":"Recovery properties at pod were already defined. Skipping server restart.","Request.Namespace":"msimka-namespace","Request.Name":"tx-client","Pod Name":"tx-client-0"}
> {"level":"info","ts":1570636703.5971646,"logger":"wildflyserver_controller","msg":"Executing recovery scan at tx-client-0","Request.Namespace":"msimka-namespace","Request.Name":"tx-client","Pod IP":"10.128.1.34","Recovery port":4712}
> {"level":"info","ts":1570636709.162107,"logger":"wildflyserver_controller","msg":"Executing recovery scan at tx-client-0","Request.Namespace":"msimka-namespace","Request.Name":"tx-client","Pod IP":"10.128.1.34","Recovery port":4712}
> {"level":"info","ts":1570636714.7187033,"logger":"wildflyserver_controller","msg":"In-doubt transactions in object store","Request.Namespace":"msimka-namespace","Request.Name":"tx-client","Pod Name":"tx-client-0","Message":"WildFly Transaction Client data dir is not empty and scaling down of the pod 'tx-client-0' will be retried.Wildfly Transacton Client data dir path '/opt/eap/standalone/data/ejb-xa-recovery', output listing: 20005_00000000000000000000ffff0a80012251f788695d9e026b0000001374782d636c69656e742d30_00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\n"}
> {noformat}

--
This message was sent by Atlassian Jira
(v7.13.8#713008)