[jbossts-issues] [JBoss JIRA] (JBTM-3015) Tomcat JTA: Race condition in tx recovery makes PeriodicRecovery run forever

Michal Karm Babacek (JIRA) issues at jboss.org
Tue Apr 24 11:00:00 EDT 2018


    [ https://issues.jboss.org/browse/JBTM-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566261#comment-13566261 ] 

Michal Karm Babacek commented on JBTM-3015:
-------------------------------------------

[~tomjenkinson], I have it looping in debugger right now and it looks like [this call has stopped returning|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/dbcp2/managed/ManagedDataSource.java#L81] and it hangs. Pertinent threads with relevant sources:
{quote}
"http-nio-8080-exec-1 at 2786" daemon prio=5 tid=0x19 nid=NA waiting
  java.lang.Thread.State: WAITING
        at sun.misc.Unsafe.park(Unsafe.java:-1)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at org.apache.tomcat.dbcp.pool2.impl.LinkedBlockingDeque.takeFirst([LinkedBlockingDeque.java:590|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/pool2/impl/LinkedBlockingDeque.java#L590])
        at org.apache.tomcat.dbcp.pool2.impl.GenericObjectPool.borrowObject([GenericObjectPool.java:440|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/pool2/impl/GenericObjectPool.java#L440])
        at org.apache.tomcat.dbcp.pool2.impl.GenericObjectPool.borrowObject([GenericObjectPool.java:361|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/pool2/impl/GenericObjectPool.java#L361])
        at org.apache.tomcat.dbcp.dbcp2.managed.ManagedConnection.updateTransactionStatus([ManagedConnection.java:127|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/dbcp2/managed/ManagedConnection.java#L127])
        at org.apache.tomcat.dbcp.dbcp2.managed.ManagedConnection.<init>([ManagedConnection.java:59|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/dbcp2/managed/ManagedConnection.java#L59])
        at org.apache.tomcat.dbcp.dbcp2.managed.ManagedDataSource.getConnection([ManagedDataSource.java:81|https://github.com/apache/tomcat/blob/TOMCAT_9_0_7/java/org/apache/tomcat/dbcp/dbcp2/managed/ManagedDataSource.java#L81])
        at org.jboss.narayana.tomcat.jta.integration.app.TestExecutor.doesEntryExist([TestExecutor.java:147|https://github.com/jbosstm/narayana/blob/master/tomcat/tomcat-jta/src/test/java/org/jboss/narayana/tomcat/jta/integration/app/TestExecutor.java#L147])
        at org.jboss.narayana.tomcat.jta.integration.app.TestExecutor.didRecoveryHappen([TestExecutor.java:139|https://github.com/jbosstm/narayana/blob/master/tomcat/tomcat-jta/src/test/java/org/jboss/narayana/tomcat/jta/integration/app/TestExecutor.java#L139])
        at org.jboss.narayana.tomcat.jta.integration.app.TestExecutor.getRecoveryTestResponse([TestExecutor.java:127|https://github.com/jbosstm/narayana/blob/master/tomcat/tomcat-jta/src/test/java/org/jboss/narayana/tomcat/jta/integration/app/TestExecutor.java#L127])
        at org.jboss.narayana.tomcat.jta.integration.app.TestExecutor.verifyRecovery([TestExecutor.java:115|https://github.com/jbosstm/narayana/blob/master/tomcat/tomcat-jta/src/test/java/org/jboss/narayana/tomcat/jta/integration/app/TestExecutor.java#L115])
        <SNIP>

"Transaction Reaper Worker 0 at 3278" daemon prio=5 tid=0x3b nid=NA waiting
  java.lang.Thread.State: WAITING
        at java.lang.Object.wait(Object.java:-1)
        at java.lang.Object.wait(Object.java:502)
        at com.arjuna.ats.arjuna.coordinator.TransactionReaper.waitForCancellations([TransactionReaper.java:328|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/coordinator/TransactionReaper.java#L328])
        at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run([ReaperWorkerThread.java:65|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/coordinator/ReaperWorkerThread.java#L65])

"Periodic Recovery at 3254" daemon prio=5 tid=0x36 nid=NA waiting
  java.lang.Thread.State: WAITING
        at java.lang.Object.wait(Object.java:-1)
        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doBackoffWait([PeriodicRecovery.java:663|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L633])
        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal([PeriodicRecovery.java:781|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L781])
        at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run([PeriodicRecovery.java:377|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L377])

"Transaction Reaper at 3277" daemon prio=5 tid=0x3a nid=NA waiting
  java.lang.Thread.State: WAITING
        at java.lang.Object.wait(Object.java:-1)
        at com.arjuna.ats.internal.arjuna.coordinator.ReaperThread.run([ReaperThread.java:90|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/coordinator/ReaperThread.java#L90])

"Transaction Expired Entry Monitor at 3256" daemon prio=5 tid=0x37 nid=NA waiting
  java.lang.Thread.State: WAITING
        at java.lang.Object.wait(Object.java:-1)
        at com.arjuna.ats.internal.arjuna.recovery.ExpiredEntryMonitor.run([ExpiredEntryMonitor.java:190|https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/ExpiredEntryMonitor.java#L190])
{quote}

The maximum number of connections is set:
 * [here for the DBCP pool|https://github.com/jbosstm/narayana/blob/master/tomcat/tomcat-jta/src/test/java/org/jboss/narayana/tomcat/jta/integration/AbstractCase.java#L171]
 * and [here for the database itself|https://github.com/jbosstm/narayana/blob/master/tools/src/main/java/io/narayana/db/PostgreContainerAllocator.java#L166] (a sufficiently bigger number I would say...)

I am definitely fine with editing the test case, but I would like to understand whether the connection not being returned is really the case - i.e. if it is really an expected behaviour given the byteman script and the nature of the test.

Is it so...?

> Tomcat JTA: Race condition in tx recovery makes PeriodicRecovery run forever
> ----------------------------------------------------------------------------
>
>                 Key: JBTM-3015
>                 URL: https://issues.jboss.org/browse/JBTM-3015
>             Project: JBoss Transaction Manager
>          Issue Type: Bug
>          Components: Tomcat
>    Affects Versions: 5.next
>         Environment: Oracle JDK 8, Linux, 8 CPU system
>            Reporter: Michal Karm Babacek
>            Assignee: Amos Feng
>            Priority: Critical
>         Attachments: good_threaddumps.zip, postgresql_trace_log.zip, test.war, thread_dumps_in_sequence_over_time.zip, tomcat_logs.zip
>
>
> There is a simple [recovery test in the test suite|https://github.com/jbosstm/narayana/blob/master/tomcat/tomcat-jta/src/test/java/org/jboss/narayana/tomcat/jta/integration/BaseITCase.java#L141] that runs 3 times in a row. If you change that number to 4, there is a good change you will get the TS stuck forever. When I attached debugger and used breakpoints on the iteration, e.g. {{i==3}} or {{i==10}}, the error was not shown, the TS run and completed, so it is clearly a time related race condition.
> *EDIT:* When I use 30 iterations and a conditional breakpoint on {{i -> test(EXECUTOR_URL + RECOVERY_TEST)}} set to {{i == 28}} the execution loops forever even with debugger attached.
> In this JIRA's example, to reproduce, simply:
> {noformat}
> -        IntStream.range(0, 3).forEach(i -> test(EXECUTOR_URL + RECOVERY_TEST));
> +        IntStream.range(0, 10).forEach(i -> test(EXECUTOR_URL + RECOVERY_TEST));
> {noformat}
> When the Tomcat gets into this unfortunate state and TS hangs and Tomcat runs forever, this is what keeps showing in the database trace (and nothing else):
> {noformat}
> ...
> 2018-04-24 12:37:50.406 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:37:54.410 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:37:58.415 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:38:02.419 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:38:06.421 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:38:10.423 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:38:14.426 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> 2018-04-24 12:38:18.429 UTC transaction_id: 0 LOG:  execute <unnamed>: SELECT gid FROM pg_prepared_xacts where database = current_database()
> ...
> {noformat}
> I found this Atomikos related reference [online|https://fogbugz.atomikos.com/default13dd.html?community.6.3576.0].
> Tomcat log keeps showing again and again:
> {noformat}
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run PeriodicRecovery: background thread Status <== SCANNING
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run PeriodicRecovery: background thread scanning
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal Periodic recovery first pass at Tue, 24 Apr 2018 14:41:02
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.periodicWorkFirstPass AtomicActionRecoveryModule first pass
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.processTransactions processing /StateManager/BasicAction/TwoPhaseCoordinator/AtomicAction transactions
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal  
> 24-Apr-2018 14:41:02.537 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.setScanState XARecoveryModule state change IDLE->FIRST_PASS
> 24-Apr-2018 14:41:02.538 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkFirstPass Local XARecoveryModule - first pass
> 24-Apr-2018 14:41:02.538 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass xarecovery of org.postgresql.xa.PGXAConnection at 2c8cb5bd
> 24-Apr-2018 14:41:02.538 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass Found 0 xids in doubt
> 24-Apr-2018 14:41:02.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass xarecovery of org.jboss.narayana.tomcat.jta.integration.app.TestXAResource at 4094781e
> 24-Apr-2018 14:41:02.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoveryFirstPass Found 0 xids in doubt
> 24-Apr-2018 14:41:02.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.setScanState XARecoveryModule state change FIRST_PASS->BETWEEN_PASSES
> 24-Apr-2018 14:41:02.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal  
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal Periodic recovery second pass at Tue, 24 Apr 2018 14:41:04
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.periodicWorkSecondPass AtomicActionRecoveryModule second pass
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal  
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.setScanState XARecoveryModule state change BETWEEN_PASSES->SECOND_PASS
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkSecondPass Local XARecoveryModule - second pass
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkSecondPass Local XARecoveryModule.transactionInitiatedRecovery completed
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoverySecondPass xarecovery second pass of org.postgresql.xa.PGXAConnection at 2c8cb5bd
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoverySecondPass Have 0 Xids to recover on this pass.
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoverySecondPass xarecovery second pass of org.jboss.narayana.tomcat.jta.integration.app.TestXAResource at 4094781e
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoverySecondPass Have 0 Xids to recover on this pass.
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkSecondPass Local XARecoveryModule.resourceInitiatedRecovery completed
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.setScanState XARecoveryModule state change SECOND_PASS->IDLE
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal  
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run PeriodicRecovery: background thread Status <== INACTIVE
> 24-Apr-2018 14:41:04.539 DEBUG [Periodic Recovery] com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run PeriodicRecovery: background thread backing off
> {noformat} 
> h3. Logs and thread dumps
> * [^postgresql_trace_log.zip] -- PostgreSQL trace, note the originally expected course of recoveries and then the repeated stuck on the aforementioned message.
> * [^test.war] -- test app used; the TS builds it for you though and stores it in {{tomcat/tomcat-deployment/_DEFAULT__Basic-app_test.war}}
> * [^thread_dumps_in_sequence_over_time.zip] -- thread dumps taken over time with minutes in between snaps.
> * [^tomcat_logs.zip] -- Tomcat logs
> * [^good_threaddumps.zip] --  a dozen of snapshots couple of seconds from each other as an example of a good TS run that terminates as expected, i.e. with {{IntStream.range(0, 3).forEach(i -> test(EXECUTOR_URL + RECOVERY_TEST));}}.
> * This is what good, i.e. 3 iterations long DB trace looks like: [good db trace|https://paste.fedoraproject.org/paste/drANRRl4DlckwLhSIwCOdA/raw].
> * This is what good, i.e. 3 iterations long Tomcat catalina log looks like:  [catalina log|https://paste.fedoraproject.org/paste/CkaDCZCtoVhDtyTglEOFqg/raw]
> Might interest also [~ochaloup].   WDYT guys? What should I try out?



--
This message was sent by Atlassian JIRA
(v7.5.0#75005)


More information about the jbossts-issues mailing list