[jboss-jira] [JBoss JIRA] (WFLY-88) Recovery not fully triggered when distributed transaction falls down at prepare phase of 2PC

Fri Nov 1 09:10:02 EDT 2013

    [ https://issues.jboss.org/browse/WFLY-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12827442#comment-12827442 ] 

RH Bugzilla Integration commented on WFLY-88:
---------------------------------------------

Ondrej Chaloupka <ochaloup at redhat.com> made a comment on [bug 952746|https://bugzilla.redhat.com/show_bug.cgi?id=952746]

Hi David,

I've checked the current state of the issue (as it's longer time that I've been checking it) and I can say that there is still the problem in the waking up the ejb remote connection when the remote server (remote server which is called from client server - via outbound connection from client server) crashes and then comes up again. Then the client sever (it started the tx) does not know nothing about the remote server is up and that the recovery can be done.

This happen just for the distributed JTA transactions. The JTS transactions manage the distributed communication between nodes and the recovery starts without problem.

The workaround for the recovery is to call a remote method from the client server to the remote server after the remote server comes back to life. Then the crash recovery will start.

The test scenario when this problem occurs look:
 - transaction is started on the client server 
 - the client server does call via outbound connection to the remote server (tx context is propagated to remote server)
 - the remote server sends a message to a queue (simulation of some action done during the transaction)
 - finishing the remote call and the bean method
 - the transaction started 2PC. The prepare phase is done and the commit phase is started. The remote server crashes at the entry to the commit method
 - client server is still alive
 - remote server comes to life
 - the crash recovery should proceed the commit as all the participant agreed on it

I would put here the explanation from Jaikiran:
When a connection breaks down between the server and the client, specifically when the client goes down and comes back up again, then the server and the client will not auto communicate with each other. 
In other words, the server will have no knowledge (in EJB resource sense) that the client has come back up again. That effectively means that the EJB tx recovery process will have no clue of the EJB nodes to communicate with.
To deal with that, there should be some communication from the client (which is now up) to the server to reestablish that connection. 
In a real application, it would be the first invocation from the client to the server. 

I've checked that the call from the client server to remote one really establishes the connection and recovery starts.
B the next call from the client to server could take some time and meanwhile the transaction could be rollbacked because of the timeout.

What do you think about this?
I think that current behavior is not correct. We agreed on it with Jaikiran before as well but he haven't got a time to fix it (https://bugzilla.redhat.com/show_bug.cgi?id=952746#c15).

Thanks
Ondra

> Recovery not fully triggered when distributed transaction falls down at prepare phase of 2PC
> --------------------------------------------------------------------------------------------
>
>                 Key: WFLY-88
>                 URL: https://issues.jboss.org/browse/WFLY-88
>             Project: WildFly
>          Issue Type: Bug
>      Security Level: Public(Everyone can see) 
>          Components: EJB, Remoting
>            Reporter: Ivo Studensky
>            Assignee: jaikiran pai
>             Fix For: 8.0.0.Alpha1
>
>         Attachments: logs_prepareHaltClient.tgz
>
>
> It looks like recovery process is not fully triggered on a distributed transaction when the transaction falls down at prepare phase of 2PC. In the new crash recovery tests over propagated transactions only one of two servers recovers from the crash, but the other keeps an unfinished tx in its tx log. 
> It corresponds to prepareHaltClient and prepareHaltServer test methods of org.jboss.as.test.jbossts.crashrec.txpropagation.TxPropagationCrashRecoveryTestCase, see JBQA-2604 for general description of the new tests. The prepareHaltClient test crashes the server which initiated the transaction, while as the prepareHaltServer test crashes the second server.
> The tests are written against EAP6.x branch, so for reproducing this it is needed a built server from the 7.1 branch of AS7.
> Steps to reproduce.
> 1. git clone -b as7 git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-transactions.git
> 2. cd eap-tests-transactions
> 3. git checkout tx_propag_crashrec_tests
> 4a. mvn clean verify -Dtest=TxPropagationCrashRecoveryTestCase#prepareHaltClient -Djboss.dist=<path to jboss-as-7.1.3.Final-SNAPSHOT>
> or
> 4b. mvn clean verify -Dtest=TxPropagationCrashRecoveryTestCase#prepareHaltServer -Djboss.dist=<path to jboss-as-7.1.3.Final-SNAPSHOT>
> The logs of prepareHaltClient run attached to this jira.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira