[JBoss JIRA] (JBTM-3297) Move LRA failure records to another part of the store
by Michael Musgrove (Jira)
[ https://issues.redhat.com/browse/JBTM-3297?page=com.atlassian.jira.plugin... ]
Michael Musgrove edited comment on JBTM-3297 at 7/15/20 7:42 AM:
-----------------------------------------------------------------
Here is some more detail about what this task entails:
After a transaction has prepared we create a transaction record so that if a failure occurs we can recover it. The responsibility for processing failures is the recovery system [1]. We have a number of recovery modules [2], each responsible for a different transaction log type. The different log types are stored in specific locations in the object store (for example LRAs are stored at [7]. A recovery module manages records of a particular type. The recovery manager periodically asks each recovery module to check for any logs in need of recovery [3].
The recovery module responsible for completing Long Running Actions (LRAs) is [4]. When it runs it looks for records in the store of a specific type/location [5, 7]. But an LRA can finish in a failure state in which case it can never be recovered. However logs in the failed state will be processed on every recovery pass (every few minutes) since they are of the same type as LRA records that do need recovery. This is not efficient (since there will never be a recovery attempt on a failed LRA) and will eventually, if there are many failures, significantly degrade performance.
But we still need to keep hold of logs for failed LRAs for reporting purposes so that an external management system can inspect them, for example [6] .
This task is to move these failed records to a different location [8] so that the LRA recovery module will ignore them. The LRA recovery coordinator [6] will need to look in this new location when reporting or deleting failed logs. An example of where we already perform this kind of operation is in the ExpiredTransactionScanner#moveEntry [9]
[1] ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/recovery/RecoveryManager.java
[2] ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/recovery/RecoveryModule.java
[3] ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java
[4] rts/lra/lra-coordinator-jar/src/main/java/io/narayana/lra/coordinator/internal/LRARecoveryModule.java
[5] look for LRARecoveryModule#_transactionType (= io.narayana.lra.coordinator.domain.model.Transaction.getType() )
[6] rts/lra/lra-coordinator-jar/src/main/java/io/narayana/lra/coordinator/api/RecoveryCoordinator.java (see method getFailedLRAs())
[7] rts/lra/lra-coordinator-jar/src/main/java/io/narayana/lra/coordinator/domain/model/Transaction.java (LRA_TYPE = "/StateManager/BasicAction/TwoPhaseCoordinator/LRA")
[8] something similar to LRA_FAILED_TYPE = LRA_TYPE + "/Failed" (ie "/StateManager/BascAction/TwoPhaseCoordinator/LRA/Failed")
[9] ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/ExpiredTransactionScanner.java
was (Author: mmusgrov):
Here is some more detail about what this task entails:
After a transaction has prepared we create a transaction record so that if a failure occurs we can recover it. The responsibility for processing failures is the recovery system [1]. We have a number of recovery modules [2], each responsible for a different transaction log type. The different log types are stored in specific locations in the object store (for example LRAs are stored at [7]. A recovery module manages records of a particular type. The recovery manager periodically asks each recovery module to check for any logs in need of recovery [3].
The recovery module responsible for completing Long Running Actions (LRAs) is [4]. When it runs it looks for records in the store of a specific type/location [5, 7]. But an LRA can finish in a failure state in which case it can never be recovered. However logs in the failed state will be processed on every recovery pass (every few minutes) since they are of the same type as LRA records that do need recovery. This is not efficient (since there will never be a recovery attempt on a failed LRA) and will eventually, if there are many failures, significantly degrade performance.
But we still need to keep hold of logs for failed LRAs for reporting purposes so that an external management system can inspect them, for example [6] .
This task is to move these failed records to a different location [8] so that the LRA recovery module will ignore them. The LRA recovery coordinator [6] will need to look in this new location when reporting or deleting failed logs. An example of where we already perform this kind of operation is in the ExpiredTransactionScanner#moveEntry [9]
[1] ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/recovery/RecoveryManager.java
[2] ArjunaCore/arjuna/classes/com/arjuna/ats/arjuna/recovery/RecoveryModule.java
[3] ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java
[4] rts/lra/lra-coordinator-jar/src/main/java/io/narayana/lra/coordinator/internal/LRARecoveryModule.java
[5] look for LRARecoveryModule#_transactionType (= io.narayana.lra.coordinator.domain.model.Transaction.getType();)
[6] rts/lra/lra-coordinator-jar/src/main/java/io/narayana/lra/coordinator/api/RecoveryCoordinator.java (see method getFailedLRAs())
[7] rts/lra/lra-coordinator-jar/src/main/java/io/narayana/lra/coordinator/domain/model/Transaction.java (LRA_TYPE = "/StateManager/BasicAction/TwoPhaseCoordinator/LRA")
[8] something similar to LRA_FAILED_TYPE = LRA_TYPE + "/Expired" (ie "/StateManager/BascAction/TwoPhaseCoordinator/LRA/Expired")
[9] ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/ExpiredTransactionScanner.java
> Move LRA failure records to another part of the store
> -----------------------------------------------------
>
> Key: JBTM-3297
> URL: https://issues.redhat.com/browse/JBTM-3297
> Project: JBoss Transaction Manager
> Issue Type: Enhancement
> Components: LRA
> Affects Versions: 5.10.4.Final
> Reporter: Michael Musgrove
> Assignee: Mayank Kunwar
> Priority: Optional
>
> Prior to JBTM-3247 we deleted LRA failure records (after reporting them). With the fix for JBTM-3247 we now retain failure records which can impact processing of the transaction logs. These records should be moved to another part of the store (note that they can still be queried and deleted by the user).
> I have marked the priority as optional even though it is highly desirable.
--
This message was sent by Atlassian Jira
(v7.13.8#713008)