[jboss-jira] [JBoss JIRA] (WFLY-8929) Race condition if timers overlap due to long running execution and short schedules if database persistence is used

Tue Jun 13 18:09:00 EDT 2017

    [ https://issues.jboss.org/browse/WFLY-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13420813#comment-13420813 ] 

Stuart Douglas commented on WFLY-8929:
--------------------------------------

Looking at server1.log this is what is happening (starting at around 2017-06-13 20:01:42,000, the relevant threads are 4 and 5, 4 is about to finish running a task, 5 is about to attempt to run the new one):

[5] calls Timer.setState(ACTIVE) - internal timer state is active, DB state is still IN_TIMEOUT as the timer has not been persisted yet
[4] attempts to run the timer, sees that the state is ACTIVE so calls shouldRun(). ShouldRun fails as the DB state is still IN_TIMEOUT
[4] calculates a new timeout, but does not update the DB
[5] changes the DB state to ACTIVE

Which is how this gets out of sync. The solution looks to be fairly simple, and basically involves moving the lock up to the beginning of the method after the debug statements. 

Can you try https://github.com/wildfly/wildfly/compare/master...stuartwdouglas:WFLY-8929?expand=1 against your test case? I think this should fix it.

> Race condition if timers overlap due to long running execution and short schedules if database persistence is used
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: WFLY-8929
>                 URL: https://issues.jboss.org/browse/WFLY-8929
>             Project: WildFly
>          Issue Type: Bug
>          Components: EJB
>         Environment: Configure DB persistence for timers as file-persistence will not have a persistence check for shouldRun to lock the timer execution.
>            Reporter: Wolf-Dieter Fink
>            Assignee: Stuart Douglas
>         Attachments: server-extract.log, server1.log
>
>
> If timers (here calendar timer) are running longer than scheduled, or the schedule/processing get stuck do to thread or cpu bottleneck, it is possible that the updates for persistence overlap.
> The issue seems that the task(1) try to finish the timer and task(2) is about to start but see the concurrency.
> The DB is updated with the 'old' next timeout, but the internal Timer instance will be updated with the next possible schedule due to a race condition between the two threads updating the object.

--
This message was sent by Atlassian JIRA
(v7.2.3#72005)