Greetings:
Well, I really wish more people would read what Steve and I have been
blogging about benchmarks for the past few years. While one proposes
improving Miss Manners the other proposes dropping it altogether. My
personal feelings "used" to be that we could improve Miss Manners so
that it could be used. Now, I feel that is any "improvement" will
come through massive amounts of data because the rules are WAY too
simplistic - and then all you will be testing is the performance of
the system itself and NOT the rulebase.
Also, if you trace the firing of the rules, you will find that once
started, for all practical purposes, only one rule keeps firing over
and over. Miss Manners was designed for one purpose; to stress-test
the Agenda Table by recursively putting all of the guests on the table
over and over and over again. Greg Barton - as stated earlier
somewhere - repeated this test using straight-up Java and did it just
as quickly.
Regardless, a rulebase benchmark should be composed of several tests:
Forward Chaining
Backward Chaining
Non-Monotonicity (The hallmark of a rulebase)
Complex Rules
Rules with a high level of Specificity
Lots of (maybe 100 or more) "simple" rules that chain between themselves
Stress the conflict resolution strategy
Stress pattern matching
Just having an overwhelming amount of data is not sufficient for a
rulebase benchmark - that would be more in line with a test of the
database efficiency and/or the available memory. Further, it has been
"proven" over time that compiling rules into Java code or into C++
code (something that vendors call "sequential rules") is much faster
than using the inference engine. True, and it should be. After all,
most inference engines are based in Java or C++ code and the rules are
merely an extension, another layer of abstraction, if you will. But
sequential rules do not have the flexibility of the engine and, in
most cases, have to be "manually" arranged so that they fire in the
correct order. An inference engine, being non-monotonic, does not have
that restriction. Simply put, most rulebased systems cannot pass
muster on the simple WaltzDB-16 benchmark. We now have a WaltzDB-200
test should they want to try something more massive.
New Benchmarks: Perhaps we should try some of the NP-hard problems -
that would eliminate most of the "also ran" tools. Also, perhaps we
should be checking on the "flexibility" of a rulebase by processing on
multiple platforms (not just Windows) as well as checking performance
and scalability on multiple processors; perhaps 4, 8 or 16 (or more)
CPU machines. An 8/16 CPU Mac is now available at a reasonable price
as is the i7 Intel (basically 4/8 cores) CPU. But these are 64-bit
CPUs and some rule engines are not supported for 64-bit platforms.
Sad, but true. Some won't even run on Unix but only on LInux. Again,
sad, but true.
So, any ideas? I'm thinking that someone, somewhere has a better
suggestion than a massive decision table, 64-Queens, Sudoku or a
revision of the Minnnesota database benchmark. Hopefully... For now,
I strongly suggest that we resolve to use the WaltzDB-200 benchmark
(which should satisfy all parties for this year) and develop something
much better for 2010.
jco
"NOW do you believe?"
(Morpheus to Trinity in "The Matrix")
Buy USA FIRST !
http://www.kbsc.com [Home base for AI connections]
http://www.OctoberRulesFest.org [Home for AI conferences]
http://JavaRules.blogspot.com [Java-Oriented Rulebased Systems]
http://ORF2009.blogspot.com [October Rules Fest]
http://exscg.blogspot.com/ [Expert Systems Consulting Group]
On Mar 28, 2009, at 1:21 AM, Wolfgang Laun wrote:
The organizers of RuleML 2009 have announced that one of their
topics for the Challenge
http://www.defeasible.org/ruleml2009/
challenge
of this year's event is going to be "benchmark for evaluation of
rule engines".
Folks interested in benchmarks are cordially invited to provide
requirements, outlines, ideas, etc., for benchmarks to be sent in for
the Challenge, and, of course, to submit actual benchmarks to the
RuleML Symposium.
Spearhead RBS implementations such as Drools provide an
excellent arena for real-world applications of rules. Personally, I
think
that benchmarks should not only address purely FOL-lish
pattern combinations but also assess how well the interaction with the
embedding environment (such as predicate evaluation,
the execution of RHS consequences with agenda updates, etc.,)
is handled.
Regards
Wolfgang Laun
RuleML 2009 Program Committee
On Fri, Mar 27, 2009 at 11:09 PM, Steve Núñez <brms(a)illation.com.au>
wrote:
Mark,
Agreed that Manners needs improvement. In it's current form, it's
nearly
useless as a comparative benchmark. You might want to check with
Charles
Young, if he's not on the list, who did a very through analysis of
Manners a
while back and may have some ideas.
Whilst on the topic, I am interested in any other benchmarking ideas
that
folks may have. We're in the process of putting together (hopefully)
comprehensive set of benchmarks for performance testing.
Cheers,
- Steve
On 28/03/09 5:06 AM, "Mark Proctor" <mproctor(a)codehaus.org> wrote:
> I was wondering if anyone fancied having a go at improving Miss
Manners
> to make it harder and less easy to cheat. The problem with manners
at
> the moment is that it computes a large cross product, of which
only one
> rule fires and the other activations are cancelled. What many
engines do
> now is abuse the test by not calculating the full cross product
and thus
> not doing all the work.
>
> Mannsers is explained here:
>
https://hudson.jboss.org/hudson/job/drools/lastSuccessfulBuild/artifact/t...
> target/docs/drools-expert/html/ch09.html#d0e7455
>
> So I was thinking that first the amounts of data needs to be
increased
> from say 128 guests to 512 guests. Then the problem needs to be
made
> harder, and the full conflict set needs to be forced to be
evalated. So
> maybe the first assign_seating rule is as normal where it just
finds M/F
> pairs with same hobbies, but additionally we should have a scoring
> process so that those matched in the first phase then each must have
> some compatability score calculated against them and then the one
with
> the best score is picked. Maybe people have other ways to improve
the
> complexity of the test, both in adding more rules and more complex
rules
> and more data.
>
> Mark
>
> _______________________________________________
> rules-dev mailing list
> rules-dev(a)lists.jboss.org
>
https://lists.jboss.org/mailman/listinfo/rules-dev
--
Level 40
140 William Street
Melbourne, VIC 3000
Australia
Phone: +61 3 9607 8287
Mobile: +61 4 0096 4240
Fax: +61 3 9607 8282
http://illation.com.au
_______________________________________________
rules-dev mailing list
rules-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/rules-dev
_______________________________________________
rules-dev mailing list
rules-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/rules-dev