Hello Michael and Greg,
I have pulled the drools head, made the patch that Greg suggested (thanks Greg!) and
deployed the drools-core jar with my app. Prior to this change, I was able to reproduce
the deadlock - verified in the debugger and in exactly the same place as my earlier post -
roughly 50% of the time. I have tried the same test scenario, now, 10 times with no
failures.
From what I can tell, this problem will easily happen every time that
I am already in the 'delayed execution' code created by the rule with the
'duration()' qualifier and, at the same time, I get a new 'fact' and
attempt to insert it. Knowing this, I can probably come up with a simpler test case.
I'll give it a shot.
Thanks, again, to you both for the quick responses.
Edward
-----Original Message-----
From: rules-dev-bounces(a)lists.jboss.org [mailto:rules-dev-bounces@lists.jboss.org] On
Behalf Of Michael Neale
Sent: Wednesday, November 04, 2009 2:26 AM
To: Rules Dev List
Subject: Re: [rules-dev] Deadlock in the Drools core - Drools 5.0 - any suggestions for
resolution?
ha - was just musing with someone the other day who uses "duration"
anymore ;) I guess its still useful to people !
I would say that the "duration" codebase is probably fairly "old" -
in
the sense that it probably pre-dates the availability of
j.u.concurrent (which was java 5 I think? ) - so please try out that
patch, if it works, we can probably pull it in (hoping Edson can take
a look).
On Wed, Nov 4, 2009 at 4:43 PM, Greg Barton <greg_barton(a)yahoo.com> wrote:
Well, I'm not sure how to avoid the deadlock without changing the
drools codebase. I was, however, able to change the type of
AbstractWorkingMemory.actionQueue to java.util.concurrent.ConcurrentLinkedQueue and remove
the synchronization over the queue with no apparent ill effects. (Two tests failed for
drools-core, but they failed whether the change was made or not.) Also I don't like
the fact that the current code synchronizes on actionQueue, but then exposes it outside
the class through the getActionQueue() method, where access can be unsynchronized.
Changing it to ConcurrentLinkedQueue makes it safe to expose externally. (Not to mention
that the lock can be stolen externally with the current code.)
diff attached. If you can run drools compiled from trunk, apply the diff and see if it
resolves the deadlock. If it does it's up to the drools devs as to whether the change
should be made. I'm just hacking about. :P
--- On Tue, 11/3/09, Edward Archibald <edward.archibald(a)continuent.com> wrote:
> From: Edward Archibald <edward.archibald(a)continuent.com>
> Subject: [rules-dev] Deadlock in the Drools core - Drools 5.0 - any suggestions for
resolution?
> To: "rules-dev(a)lists.jboss.org" <rules-dev(a)lists.jboss.org>
> Date: Tuesday, November 3, 2009, 9:41 PM
>
> I found the following deadlock which is, apparently, due to
> the concurrent execution
> of a task for a 'delayed' rule with a concurrently
> executing application thread attempting to get access to a
> 'global'. Any recommendations for avoiding this type
> of deadlock besides not using rules with 'duration()' etc.
> which cause asynchronous execution with respect to my main
> application thread?
>
> This problem is somewhat difficult to reproduce on demand
> but it does come up frequently when the 'delayed' rule
> "DETECT MONITORING HAS STOPPED" is activated as a result of
> the trigger conditions.
>
> ===================================================================================
>
> This thread, my application's EnterprisePolicyManager
> thread, is attempting to get access to a global, policyMgr,
> and is waiting for
> the 'lock.lock' on RetooStatefulSession
>
> It owns the 'ReteooStatefulSession.actionQueue'
> and is waiting for the ReteooStatefulSession.lock.lock
>
> owns: java.util.LinkedList<E> (id=207)
> waited by: Thread [pool-3-thread-1] (Suspended)
> owns:
> com.continuent.tungsten.cluster.manager.rules.engine.RulesEngine
> (id=208)
> sun.misc.Unsafe.park(boolean, long) line: not available
> [native method] [local variables unavailable]
> java.util.concurrent.locks.LockSupport.park() line: 118
> [local variables unavailable]
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()
> line: 681 [local variables unavailable]
>
java.util.concurrent.locks.ReentrantLock$NonfairSync(java.util.concurrent.locks.AbstractQueuedSynchronizer).acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node,
> int) line: 711
>
java.util.concurrent.locks.ReentrantLock$NonfairSync(java.util.concurrent.locks.AbstractQueuedSynchronizer).acquire(int)
> line: 1041
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock()
> line: 184 [local variables unavailable]
> java.util.concurrent.locks.ReentrantLock.lock() line: 256
> [local variables unavailable]
>
org.drools.reteoo.ReteooStatefulSession(org.drools.common.AbstractWorkingMemory).getGlobal(java.lang.String)
> line: 587
>
com.continuent.tungsten.cluster.manager.policy.Rule_IF_IN_MAINTENANCE_MODE__CONSUME_ALL_NOTIFICATIONS_0Eval0Invoker.evaluate(org.drools.spi.Tuple,
> org.drools.rule.Declaration[], org.drools.WorkingMemory,
> java.lang.Object) line: not available
> org.drools.rule.EvalCondition.isAllowed(org.drools.spi.Tuple,
> org.drools.WorkingMemory, java.lang.Object) line: 117
> org.drools.reteoo.EvalConditionNode.assertLeftTuple(org.drools.reteoo.LeftTuple,
> org.drools.spi.PropagationContext,
> org.drools.common.InternalWorkingMemory) line: 180
>
org.drools.reteoo.SingleLeftTupleSinkAdapter.doPropagateAssertLeftTuple(org.drools.spi.PropagationContext,
> org.drools.common.InternalWorkingMemory,
> org.drools.reteoo.LeftTuple) line: 117
>
org.drools.reteoo.SingleLeftTupleSinkAdapter.propagateAssertLeftTuple(org.drools.reteoo.LeftTuple,
> org.drools.reteoo.RightTuple,
> org.drools.spi.PropagationContext,
> org.drools.common.InternalWorkingMemory, boolean) line: 28
> org.drools.reteoo.JoinNode.assertObject(org.drools.common.InternalFactHandle,
> org.drools.spi.PropagationContext,
> org.drools.common.InternalWorkingMemory) line: 175
>
org.drools.reteoo.SingleObjectSinkAdapter.propagateAssertObject(org.drools.common.InternalFactHandle,
> org.drools.spi.PropagationContext,
> org.drools.common.InternalWorkingMemory) line: 42
>
org.drools.reteoo.PropagationQueuingNode$AssertAction.execute(org.drools.reteoo.ObjectSinkPropagator,
> org.drools.common.InternalWorkingMemory) line: 326
>
org.drools.reteoo.PropagationQueuingNode.propagateActions(org.drools.common.InternalWorkingMemory)
> line: 221
>
org.drools.reteoo.PropagationQueuingNode$PropagateAction.execute(org.drools.common.InternalWorkingMemory)
> line: 394
>
org.drools.reteoo.ReteooStatefulSession(org.drools.common.AbstractWorkingMemory).executeQueuedActions()
> line: 1486
> org.drools.common.NamedEntryPoint.insert(org.drools.common.InternalFactHandle,
> java.lang.Object, org.drools.rule.Rule,
> org.drools.spi.Activation) line: 158
> org.drools.common.NamedEntryPoint.insert(java.lang.Object,
> boolean, boolean, org.drools.rule.Rule,
> org.drools.spi.Activation) line: 122
> org.drools.common.NamedEntryPoint.insert(java.lang.Object)
> line: 80
>
com.continuent.tungsten.cluster.manager.rules.engine.RulesEngine.insertFact(com.continuent.tungsten.commons.cluster.resource.notification.NotificationStreamID,
> java.lang.Object, boolean) line: 162
> com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager.run()
> line: 249
> java.lang.Thread.run() line: 595
>
> The rule implicated in the above thread is:
>
> rule "IF IN MAINTENANCE MODE, CONSUME ALL NOTIFICATIONS"
> salience 999
> when
> notification : ClusterResourceNotification()
> from entry-point "MONITORING"
> eval(policyMgr.getMode() ==
> ClusterPolicyManagerMode.MAINTENANCE)
> then
> statistics.increment("IF IN
> MAINTENANCE MODE, CONSUME ALL NOTIFICATIONS");
> retract(notification);
> end
>
>
>
> This other thread, apparently a scheduled thread for a rule
> with a 10 second duration,
> is attempting to insert a fact and owns the 'lock.lock' on
> ReteooStatefulSession and
> is waiting for the 'ReteooStatefulSession.actionQueue'.
>
> owns: org.drools.common.DefaultAgenda (id=4046)
> waiting for: java.util.LinkedList<E> (id=207)
>
org.drools.reteoo.ReteooStatefulSession(org.drools.common.AbstractWorkingMemory).executeQueuedActions()
> line: 1480
>
org.drools.reteoo.ReteooStatefulSession(org.drools.common.AbstractWorkingMemory).insert(org.drools.common.InternalFactHandle,
> java.lang.Object, org.drools.rule.Rule,
> org.drools.spi.Activation, org.drools.reteoo.ObjectTypeConf)
> line: 1051
>
org.drools.reteoo.ReteooStatefulSession(org.drools.common.AbstractWorkingMemory).insert(java.lang.Object,
> boolean, boolean, org.drools.rule.Rule,
> org.drools.spi.Activation) line: 1001
> org.drools.base.DefaultKnowledgeHelper.insert(java.lang.Object,
> boolean) line: 114
> org.drools.base.DefaultKnowledgeHelper.insert(java.lang.Object)
> line: 108
>
com.continuent.tungsten.cluster.manager.policy.Rule_DETECT_MONITORING_HAS_STOPPED_0.consequence(org.drools.spi.KnowledgeHelper,
>
com.continuent.tungsten.commons.cluster.resource.notification.DataServerNotification,
> org.drools.FactHandle, java.lang.String,
> org.drools.FactHandle,
> com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager,
> org.apache.log4j.Logger) line: not available
>
com.continuent.tungsten.cluster.manager.policy.Rule_DETECT_MONITORING_HAS_STOPPED_0ConsequenceInvoker.evaluate(org.drools.spi.KnowledgeHelper,
> org.drools.WorkingMemory) line: not available
> org.drools.common.DefaultAgenda.fireActivation(org.drools.spi.Activation)
> line: 934
> org.drools.common.Scheduler$DuractionJob.execute(org.drools.time.JobContext)
> line: 70
> org.drools.time.impl.JDKTimerService$JDKCallableJob.call()
> line: 132
> org.drools.time.impl.JDKTimerService$JDKCallableJob.call()
> line: 110
> java.util.concurrent.FutureTask$Sync.innerRun() line: 269
>
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask<V>(java.util.concurrent.FutureTask<V>).run()
> line: 123
>
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.access$301(java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask)
> line: 65
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask<V>.run()
> line: 168
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(java.lang.Runnable)
> line: 650
> java.util.concurrent.ThreadPoolExecutor$Worker.run() line:
> 675
> java.lang.Thread.run() line: 595
>
> The rule for this task looks like:
> rule "DETECT MONITORING HAS STOPPED"
> duration(10s)
> salience 1000
> when
> lastNotification :
> DataServerNotification($resourceName : resourceName)
>
> from entry-point "MONITORING"
>
> not (DataServerNotification(resourceName ==
> $resourceName,
>
>
> this after [10s] lastNotification)
>
> from entry-point "MONITORING")
>
> not (ManagerFailedAlarm(expired == false,
>
> resourceName == $resourceName))
>
> not (DataSource(name == $resourceName,
>
> state == ResourceState.SHUNNED ||
>
> state == ResourceState.FAILED))
>
> then
> Object[] params = {$resourceName};
> if (policyMgr.getMode() !=
> ClusterPolicyManagerMode.MAINTENANCE)
> {
>
> lastNotification.setResourceState(ResourceState.UNKNOWN);
> ManagerFailedAlarm alarm =
>
> new
> ManagerFailedAlarm(lastNotification, "rule detected monitor
> stop",
>
>
> 6, AlarmSeverity.FAULT);
> logger.info(alarm.toString());
> insert(alarm);
> update(lastNotification);
> }
> end
>
>
>
>
>
>
>
> _______________________________________________
> rules-dev mailing list
> rules-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/rules-dev
>
_______________________________________________
rules-dev mailing list
rules-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/rules-dev
_______________________________________________
rules-dev mailing list
rules-dev(a)lists.jboss.org