[rules-users] Drools as Lexer / Parser (sequential data processing)

Wolfgang Laun wolfgang.laun at gmail.com
Fri Sep 4 02:16:28 EDT 2009


The rules for a lexer are bound to be all of the same standard pattern:

rule FirstDigit
    salience 0
    where
        $i : Input( $ch : character matches "\d" )
        $l : Lexer( state == State.IDLE, $buf : buffer )
    then
        $buf.setLength( 0 ); $buf.append( $ch );
        modify( $l ){ setState( State.NUMBER ) }
        modify( $i ){ next() }
end

Rather boring.

If you prefer, you can write a single rule exercising the FSM mechanism
(match input, execute action, change state) that is driven by a set of
(static and immutable) facts defining the individual
state/symbol/action/transition quadruples.

rule Lex
    salience 0
    when
        $k : CharClass( $pattern : pattern )                  // such as
"\d" or "\w"
        $i : Input( $ch : character matches $pattern )
        $l : Lexer( $state : state )                                  //
also containing a StringBuilder buffer
        $t : StateTransition( state == $state, charClass == $k, $action :
action, $succ : succ )
    then
        $action.execute( $i );
        modify( $l ){ setState( $succ ) }
        modify( $i ){ next() }
end

An additional rule catches errors. - An advantage of this approach is that
any lexer can now be configured by a set of fact definitions, e.g., this one
for a sequence of identifiers, numbers and quoted strings:

        CharClass letter = new CharClass( "[A-Za-z_]" );
        CharClass digit  = new CharClass( "\\d" );
        CharClass escape = new CharClass( "\\\\" );
        CharClass space  = new CharClass( "\\s" );
        CharClass quote  = new CharClass( "\"" );
        CharClass empty  = new CharClass( "\u0000" );
        ins( letter, digit, escape, space, quote, empty );

        Action<State,TokenType> save = new SaveAction( this );
        Action<State,TokenType> skip = new SkipAction( this );
        Action<State,TokenType> halt = new HaltAction( this );
        Action<State,TokenType> emit = new EmitAction( this );

        StateTransition<State,TokenType> t1 = new
StateTransition<State,TokenType>( State.S0, letter, save, State.IDENT );
        StateTransition<State,TokenType> t2 = new
StateTransition<State,TokenType>( State.S0, digit,  save, State.NUMBER );
        StateTransition<State,TokenType> t3 = new
StateTransition<State,TokenType>( State.S0, quote,  skip, State.STRING );
        StateTransition<State,TokenType> t4 = new
StateTransition<State,TokenType>( State.S0, space,  skip );
        StateTransition<State,TokenType> t5 = new
StateTransition<State,TokenType>( State.S0, empty,  halt );
        ins( t1, t2, t3, t4, t5 );

        StateTransition<State,TokenType> u1 = new
StateTransition<State,TokenType>( State.IDENT, letter, save );
        StateTransition<State,TokenType> u2 = new
StateTransition<State,TokenType>( State.IDENT, digit,  save );
        StateTransition<State,TokenType> u3 = new
StateTransition<State,TokenType>( State.IDENT, space,  emit, State.S0 );
        StateTransition<State,TokenType> u4 = new
StateTransition<State,TokenType>( State.IDENT, empty,  halt );
        ins( u1, u2, u3, u4 );

        StateTransition<State,TokenType> v1 = new
StateTransition<State,TokenType>( State.NUMBER, digit,  save );
        StateTransition<State,TokenType> v2 = new
StateTransition<State,TokenType>( State.NUMBER, space,  emit, State.S0 );
        StateTransition<State,TokenType> v3 = new
StateTransition<State,TokenType>( State.NUMBER, empty,  halt );
        ins( v1, v2, v3 );

        StateTransition<State,TokenType> w1 = new
StateTransition<State,TokenType>( State.STRING, letter, save );
        StateTransition<State,TokenType> w2 = new
StateTransition<State,TokenType>( State.STRING, digit,  save );
        StateTransition<State,TokenType> w3 = new
StateTransition<State,TokenType>( State.STRING, space,  save );
        StateTransition<State,TokenType> w4 = new
StateTransition<State,TokenType>( State.STRING, quote,  emit, State.S0 );
        StateTransition<State,TokenType> w5 = new
StateTransition<State,TokenType>( State.STRING, escape, save, State.ESCAPE
);
        StateTransition<State,TokenType> w6 = new
StateTransition<State,TokenType>( State.STRING, empty,  halt );
        ins( w1, w2, w3, w4, w5, w6 );

        StateTransition<State,TokenType> x1 = new
StateTransition<State,TokenType>( State.ESCAPE, letter, save, State.STRING
);
        StateTransition<State,TokenType> x2 = new
StateTransition<State,TokenType>( State.ESCAPE, digit,  save, State.STRING
);
        StateTransition<State,TokenType> x3 = new
StateTransition<State,TokenType>( State.ESCAPE, space,  save, State.STRING
);
        StateTransition<State,TokenType> x4 = new
StateTransition<State,TokenType>( State.ESCAPE, quote,  emit, State.STRING
);
        StateTransition<State,TokenType> x5 = new
StateTransition<State,TokenType>( State.ESCAPE, escape, save, State.STRING
);
        StateTransition<State,TokenType> x6 = new
StateTransition<State,TokenType>( State.ESCAPE, empty,  halt );
        ins( x1, x2, x3, x4, x5, x6 );

-W

On Wed, Sep 2, 2009 at 11:54 PM, André Thieme <
address.good.until.2009.dec.14 at justmail.de> wrote:

> Hello group, I recently had the idea:
> "A rule system (like Drools) is ideal for making programs with complex
>  rules simpler. Writing a lexer or parser can be non-trivial. So, is it
>  possible and also meaningful to express such a task with rules?"
>
> Anyone here who maybe tried that already?
> The two big questions for me are:
> 1) how easy is it to express a lexer with rules?
> 2) how bad (good?) will it perform?
>
> If you happen to have a good idea of how to do it, could you please give
> me an example for a simple lexer?
> Let's say it will get natural language (a string, such as this email) as
> input and should return a sequence (say, ArrayList) of Tokens, which may
> look like this:
>
> public class Token {
>   public String value;
>   public String category;
>
>   Token(String value, String category) {
>     this.value = value;
>     this.category = category;
>   }
> }
>
> We could have three categories:
> "word",  "numeric"  and  "whitespace".
>
> An input String could be:
> "We can   see 500 cars"
> And it should produce an ArrayList with the contents:
> [
>  Token("We", "word"),
>  Token(" ", "whitespace"),
>  Token("can", "word"),
>  Token("   ", "whitespace"),
>  Token("see", "word"),
>  Token(" ", "whitespace"),
>  Token("500", "numeric"),
>  Token(" ", "whitespace"),
>  Token("cars", "word")
> ]
>
> At the moment I have difficulties to see if/how this could be achieved.
> If you find this easy, please post a solution.
> I am aware that JavaCC is really good for such tasks and will also
> perform extremly well.
>
>
> Greetings,
> André
> _______________________________________________
> rules-users mailing list
> rules-users at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/rules-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/rules-users/attachments/20090904/e6feffe2/attachment.html 


More information about the rules-users mailing list