[jboss-jira] [JBoss JIRA] (WFWIP-203) Transaction recovery may hit a wrong server when remote side works with multiple pods

Tue Sep 17 08:55:00 EDT 2019

    [ https://issues.jboss.org/browse/WFWIP-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785379#comment-13785379 ] 

Martin Simka edited comment on WFWIP-203 at 9/17/19 8:54 AM:
-------------------------------------------------------------

unless I'm doing something wrong I can still reproduce it with operator built from https://github.com/ochaloup/wildfly-operator/tree/issue70-statefulset-headless-service, head 8925e7f64b6fc02b4694da63d93c0a8ce03a566d). Attaching logs. 

was (Author: simkam):
unless I'm doing something wrong I can still reproduce it with operator built from https://github.com/ochaloup/wildfly-operator/tree/issue70-statefulset-headless-service, head 8925e7f64b6fc02b4694da63d93c0a8ce03a566d)

> Transaction recovery may hit a wrong server when remote side works with multiple pods
> -------------------------------------------------------------------------------------
>
>                 Key: WFWIP-203
>                 URL: https://issues.jboss.org/browse/WFWIP-203
>             Project: WildFly WIP
>          Issue Type: Bug
>          Components: OpenShift
>            Reporter: Ondrej Chaloupka
>            Assignee: Ondrej Chaloupka
>            Priority: Blocker
>         Attachments: tx-client-0.log, tx-server-0.log, tx-server-1.log
>
>
> When server to server calls ejb remote calls where transaction context is propagated then ejb call can be routed to a one pod where the recovery call may directed to a different pod.
> Such situation causes a consistency issue.
> Let's say the scenario: the first server (let's call it `tx-client`) makes remote ejb call to remote server which is on of the servers joint in cluster named `tx-server-0` and `tx-server-1`. The `tx-client` calls the `tx-server-1`. The processing continues up to the start of the 2PC and the `tx-server-1` crashes (or host goes down, network issue happens...).
> `tx-client` understands that the process was not succesful and ask recovery manager to retry and finish.
> The recovery manager starts to call the remote server based on data saved in the object store of `tx-client`.
> But unfortunately the recovery remote call goes *not* to the `tx-server-1` but to `tx-server-0`. The `tx-client` gets error code `XAException.XAER_NOTA` (`-4`) and removes data from its object store (`/opt/eap/standalone/data/tx-object-store/`, `/opt/eap/standalone/data/ejb-xa-recovery`) and then never finishes in-doubt transactions at `tx-server-1`.
> It's in doubt if it's issue of OpenShift configuration or if it's a trouble of WFTC/ejb/remoting layer in WildFly.
> This is tested with WFLY Operator from 2019-09-26 `@90a2b3b`.

--
This message was sent by Atlassian Jira
(v7.13.5#713005)