[jboss-svn-commits] JBL Code SVN: r8338 - labs/jbossrules/trunk/documentation/manual/en/Chapter-Performance_Tuning
jboss-svn-commits at lists.jboss.org
jboss-svn-commits at lists.jboss.org
Thu Dec 14 22:03:31 EST 2006
Author: woolfel
Date: 2006-12-14 22:03:31 -0500 (Thu, 14 Dec 2006)
New Revision: 8338
Modified:
labs/jbossrules/trunk/documentation/manual/en/Chapter-Performance_Tuning/Section-Performance.xml
Log:
I've added a section on large rulesets and some strategies for address the challenge.
peter
Modified: labs/jbossrules/trunk/documentation/manual/en/Chapter-Performance_Tuning/Section-Performance.xml
===================================================================
--- labs/jbossrules/trunk/documentation/manual/en/Chapter-Performance_Tuning/Section-Performance.xml 2006-12-15 02:44:35 UTC (rev 8337)
+++ labs/jbossrules/trunk/documentation/manual/en/Chapter-Performance_Tuning/Section-Performance.xml 2006-12-15 03:03:31 UTC (rev 8338)
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="UTF-8"?>
+<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD Simplified DocBook XML V1.0//EN"
"http://www.oasis-open.org/docbook/xml/simple/1.0/sdocbook.dtd">
<section>
@@ -159,4 +159,145 @@
<para>Some other improvements are being developed for Drools in this area
and will be documented as they become available in future versions.</para>
</section>
+
+ <section>
+ <title>Large Ruleset</title>
+ <para>For this section, large rulesets are define as the following</para>
+ <itemizedlist>
+ <listitem>1-500 - small ruleset</listitem>
+ <listitem>500-2000 - medium ruleset</listitem>
+ <listitem>2000+ - large ruleset</listitem>
+ <listitem>10,000 - extremely large ruleset</listitem>
+ </itemizedlist>
+ <para>There are some cases where a rule engine has to handle 500,000 or 1 million rules.
+ Those are primarily machine learning and AI systems, where a rule engine produces new
+ rules, terms and facts at execution time. Those topics are beyond the scope of the
+ documentation and aren't covered. The techniques described are focused on business rules.</para>
+ <para>The first thing to do is identify why there are so many rules and whether or not
+ rewriting the rules can solve the problem. There's a couple of things to look for.</para>
+ <itemizedlist>
+ <listitem>Do the rules have a lot of constant values hard coded in the conditions?</listitem>
+ <listitem>Is the domain model a huge flat spreadsheet with 100+ columns?</listitem>
+ <listitem>Do most of the rules share the same conditions?</listitem>
+ <listitem>Can the logic be divided into stages?</listitem>
+ </itemizedlist>
+ <para>If you answer yes to any of the 4 questions, chances are you can solve the issue with
+ changing the rules. Managing 100,000 rules or even 1,000,000 rules is a huge headache, so
+ try to avoid it. Examine the rules and see if it matches any of the following scenarios.</para>
+ <programlisting>
+If
+ customer.account == "abcd"
+ customer.type == "basic"
+ .....
+Then
+ // do something
+ </programlisting>
+ <para>The basic problem with rules sample above, is the rules have most of the values hard
+ coded. If the average customer has 50 rules and there's 40 million customers, the system has
+ 200 million rules. Let's use a more concrete example to flesh this out.</para>
+ <programlisting>
+If
+ customer.accountId == "peter"
+ customer.type == "level2"
+ customer.favoriateActor == "jackie chan"
+Then
+ recommend movies with jackie chan
+
+If
+ customer.accountId == "peter"
+ customer.type == "level2"
+ customer.favoriateActor == "jet li"
+Then
+ recommend movies with jet li
+ </programlisting>
+ <para>Looking at the example, the first to question ask is "do these kinds of rules apply
+ to all customers?" If it does, the first condition in the rule "customer.accountId" is
+ pointless. It's pointless because all rules of this type will have that condition.
+ Although the accountId changes, the rule can effectively ignore it. If we rewrite the rule
+ this way, the rule can apply to any customer that likes jackie chan and jet li.</para>
+ <programlisting>
+If
+ customer.type == "level2"
+ customer.favoriateActor == "jackie chan"
+Then
+ recommend movies with jackie chan
+
+If
+ customer.type == "level2"
+ customer.favoriateActor == "jet li"
+Then
+ recommend movies with jet li
+ </programlisting>
+ <para>The reason we do this is straight forward. The rules reason over data. Having a
+ ton of rules with the customer's accountId hard coded doesn't do any good, because we
+ want the rule engine to only evaluate the active sessions. We don't want to load all
+ the customers into the rule engine. We can take it a step further and make the rule more
+ general.</para>
+ <programlisting>
+If
+ customer.type == "level2"
+ customer.accountId ?id // bind the account id to a variable
+ favorites.accountId ?id // find the list of favorites by the account id
+Then
+ recommend all items in the favorites
+ </programlisting>
+ <para>With this change, it can reduce the number of rules significantly. This is one
+ reason the RETE approach is often called "data driven approach". Let's take this example
+ a bit further and define 10 types of customers from level1 to level10. Say we run a mega
+ online store and customers can define their favorites in each of the categories (books,
+ videos, music, toys, electronics, clothing). What happens if a customer has different
+ levels for each category. Using the hard coded approach, one might have to add more rules.
+ If we change the rule and make it more generalized, the same rule can handle multiple
+ categories.</para>
+ <programlisting>
+If
+ recommendation.level ?lvl // bind the recommendation level to a variable
+ recommendation.category ?rcat // bind the recommendation category
+ customer.accountId ?id // bind the account id to a variable
+ favorites.accountId ?id // find the list of favorites by the account id
+ favorites.category ?rcat // match favorite to recommendation category
+ favorites.level ?lvl // match the favorite level to recommendation level
+Then
+ recommend all items in the favorites
+ </programlisting>
+ <para>So what is the cost of making the rule dynamic and data driven? Obviously, hard
+ coding a rule is going to be faster than making it generalized, but the performance delta
+ should be small. In the case where a ruleset is small, the hard coded approach may have a
+ slight performance lead. Why is that? Lets look at 2 different types of rule engines:
+ procedural and RETE.</para>
+ <para>In a procedural engine, one can build a decision tree and end the evaluation once
+ the data fails to satisfy the conditions at a given level. As the rule count increases,
+ there are more rules the engine has to evaluate. In a procedural approach, the rules have
+ to be sequenced in the optimal order to get the best results. The limitation of sorting
+ the rules in optimal sequence is that many cases it's not possible to pre-sort. If we use
+ a RETE rule engine, the hard coded rules result in fewer joins for a small number of rules.
+ As the rule count grows, the single rule will perform better. The equation to estimate the
+ threshold where the generalized form is faster than hard coding the constants.</para>
+ <para>bn = join nodes, lf = left facts, rf = right facts, ae = average number of
+ evaluation descending from the object type node for a random sample, f = facts,
+ hd = hard coded constants in the rules, general = generalized form using joins</para>
+ <para>general( sum( bn(lf * rf) ) + sum(ae * f) ) < hd( sum( bn(lf * rf) ) + sum(ae * f) )</para>
+ <para>The best way to quantify the threshold is to write rules in both formats and run a
+ series of tests. Given that most projects are under tight schedules, developers don't
+ always have time to do this. The other common problem is using really large flat objects.
+ In a nutshell, using large flat objects leads to the same problem as hard coding the
+ constants in the rules. The solution to the problem is to change the domain objects,
+ such that it models the business concepts in a concise manner. That isn't always an
+ option.</para>
+ <para>When most of the rules share the same conditions, there's two solutions. The best
+ solution is to rewrite the rules to use chaining. Identify the common conditions and extract
+ it into a generalized rule. The generalized rule then trigger subsequent rules by asserting
+ a new fact. Often this can reduce the rules by an order of magnitude or more. The second
+ option is to put common conditions at the beginning of the rule. What this does is it
+ allows RETE rule engines to share those nodes. When the nodes are shared, it reduces the
+ cost from a memory and performance perspective.</para>
+ <para>If the ruleset can be divided into smaller chunks, it's a good idea to divide it into
+ discrete stages and load each ruleset on a different JVM or server. Depending on the
+ situation, this may not be an option. So what can you do when the ruleset is large and
+ rewriting the rules isn't an option?</para>
+ <para>The only viable option is to scale the hardware and use a different JVM. This means
+ using 64bit JVM from SUN, IBM or BEA JRockit on a machine with atleast 8Gb RAM. Depending
+ on the ruleset, the system may need more RAM.</para>
+
+ </section>
</section>
\ No newline at end of file
More information about the jboss-svn-commits
mailing list