Greetings:

Well, I really wish more people would read what Steve and I have been blogging about benchmarks for the past few years.  While one proposes improving Miss Manners the other proposes dropping it altogether.  My personal feelings "used" to be that we could improve Miss Manners so that it could be used.  Now, I feel that is any "improvement" will come through massive amounts of data because the rules are WAY too simplistic - and then all you will be testing is the performance of the system itself and NOT the rulebase.

Also, if you trace the firing of the rules, you will find that once started, for all practical purposes, only one rule keeps firing over and over.  Miss Manners was designed for one purpose; to stress-test the Agenda Table by recursively putting all of the guests on the table over and over and over again.  Greg Barton - as stated earlier somewhere - repeated this test using straight-up Java and did it just as quickly.

Regardless,  a rulebase benchmark should be composed of several tests:

Forward Chaining
Backward Chaining
Non-Monotonicity (The hallmark of a rulebase)
Complex Rules
Rules with a high level of Specificity
Lots of (maybe 100 or more) "simple" rules that chain between themselves
Stress the conflict resolution strategy
Stress pattern matching

Just having an overwhelming amount of data is not sufficient for a rulebase benchmark - that would be more in line with a test of the database efficiency and/or the available memory. Further, it has been "proven" over time that compiling rules into Java code or into C++ code (something that vendors call "sequential rules") is much faster than using the inference engine. True, and it should be. After all, most inference engines are based in Java or C++ code and the rules are merely an extension, another layer of abstraction, if you will. But sequential rules do not have the flexibility of the engine and, in most cases, have to be "manually" arranged so that they fire in the correct order. An inference engine, being non-monotonic, does not have that restriction.  Simply put, most rulebased systems cannot pass muster on the simple WaltzDB-16 benchmark. We now have a WaltzDB-200 test should they want to try something more massive. 

New Benchmarks: Perhaps we should try some of the NP-hard problems - that would eliminate most of the "also ran" tools. Also, perhaps we should be checking on the "flexibility" of a rulebase by processing on multiple platforms (not just Windows) as well as checking performance and scalability on multiple processors; perhaps 4, 8 or 16 (or more) CPU machines. An 8/16 CPU Mac is now available at a reasonable price as is the i7 Intel (basically 4/8 cores) CPU. But these are 64-bit CPUs and some rule engines are not supported for 64-bit platforms. Sad, but true. Some won't even run on Unix but only on LInux. Again, sad, but true.

So, any ideas? I'm thinking that someone, somewhere has a better suggestion than a massive decision table, 64-Queens, Sudoku or a revision of the Minnnesota database benchmark. Hopefully...  For now, I strongly suggest that we resolve to use the WaltzDB-200 benchmark (which should satisfy all parties for this year) and develop something much better for 2010.

jco
"NOW do you believe?" 
(Morpheus to Trinity in "The Matrix")
Buy USA FIRST !  

http://www.kbsc.com [Home base for AI connections]
http://www.OctoberRulesFest.org [Home for AI conferences]
http://JavaRules.blogspot.com [Java-Oriented Rulebased Systems]
http://ORF2009.blogspot.com [October Rules Fest]
http://exscg.blogspot.com/ [Expert Systems Consulting Group]






On Mar 28, 2009, at 1:21 AM, Wolfgang Laun wrote:

The organizers of RuleML 2009 have announced that one of their
topics for the Challenge http://www.defeasible.org/ruleml2009/challenge
of this year's event is going to be "benchmark for evaluation of rule engines".

Folks interested in benchmarks are cordially invited to provide
requirements, outlines, ideas, etc., for benchmarks to be sent in for
the Challenge, and, of course, to submit actual benchmarks to the
RuleML Symposium.

Spearhead RBS implementations such as Drools provide an
excellent arena for real-world applications of rules. Personally, I think
that benchmarks should not only address purely FOL-lish
pattern combinations but also assess how well the interaction with the
embedding environment (such as predicate evaluation,
the execution of RHS consequences with agenda updates, etc.,)
is handled.

Regards
Wolfgang Laun
RuleML 2009 Program Committee


On Fri, Mar 27, 2009 at 11:09 PM, Steve Núñez <brms@illation.com.au> wrote:
Mark,

Agreed that Manners needs improvement. In it's current form, it's nearly
useless as a comparative benchmark. You might want to check with Charles
Young, if he's not on the list, who did a very through analysis of Manners a
while back and may have some ideas.

Whilst on the topic, I am interested in any other benchmarking ideas that
folks may have. We're in the process of putting together (hopefully)
comprehensive set of benchmarks for performance testing.

Cheers,
   - Steve

On 28/03/09 5:06 AM, "Mark Proctor" <mproctor@codehaus.org> wrote:

> I was wondering if anyone fancied having a go at improving Miss Manners
> to make it harder and less easy to cheat. The problem with manners at
> the moment is that it computes a large cross product, of which only one
> rule fires and the other activations are cancelled. What many engines do
> now is abuse the test by not calculating the full cross product and thus
> not doing all the work.
>
> Mannsers is explained here:
> https://hudson.jboss.org/hudson/job/drools/lastSuccessfulBuild/artifact/trunk/
> target/docs/drools-expert/html/ch09.html#d0e7455
>
> So I was thinking that first the amounts of data needs to be increased
> from say 128 guests to  512 guests. Then the problem needs to be made
> harder, and the full conflict set needs to be forced to be evalated. So
> maybe the first assign_seating rule is as normal where it just finds M/F
> pairs with same hobbies, but additionally we should have a scoring
> process so that those matched in the first phase then each must have
> some compatability score calculated against them and then the one with
> the best score is picked. Maybe people have other ways to improve the
> complexity of the test, both in adding more rules and more complex rules
> and more data.
>
> Mark
>
> _______________________________________________
> rules-dev mailing list
> rules-dev@lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/rules-dev

--

Level 40
140 William Street
Melbourne, VIC 3000
Australia

Phone:  +61 3 9607 8287
Mobile: +61 4 0096 4240
Fax:    +61 3 9607 8282
http://illation.com.au


_______________________________________________
rules-dev mailing list
rules-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/rules-dev

_______________________________________________
rules-dev mailing list
rules-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/rules-dev