The rules for a lexer are bound to be all of the same standard pattern:<br><br>rule FirstDigit<br> salience 0<br> where<br> $i : Input( $ch : character matches "\d" )<br>
$l : Lexer( state == State.IDLE, $buf : buffer ) <br> then<br>
$buf.setLength( 0 ); $buf.append( $ch );<br> modify( $l ){ setState( State.NUMBER ) }<br> modify( $i ){ next() }<br>end<br><br>Rather boring.<br><br>If you prefer, you can write a single rule exercising the FSM mechanism (match input, execute action, change state) that is driven by a set of (static and immutable) facts defining the individual state/symbol/action/transition quadruples. <br>
<br>rule Lex<br> salience 0<br> when<br> $k : CharClass( $pattern : pattern ) // such as "\d" or "\w"<br> $i : Input( $ch : character matches $pattern )<br>
$l : Lexer( $state : state ) // also containing a StringBuilder buffer <br> $t : StateTransition( state == $state, charClass == $k, $action : action, $succ : succ )<br> then<br>
$action.execute( $i );<br> modify( $l ){ setState( $succ ) }<br> modify( $i ){ next() }<br>end<br><br>An additional rule catches errors. - An advantage of this approach is that any lexer can now be configured by a set of fact definitions, e.g., this one for a sequence of identifiers, numbers and quoted strings:<br>
<br> CharClass letter = new CharClass( "[A-Za-z_]" );<br> CharClass digit = new CharClass( "\\d" );<br> CharClass escape = new CharClass( "\\\\" );<br> CharClass space = new CharClass( "\\s" );<br>
CharClass quote = new CharClass( "\"" );<br> CharClass empty = new CharClass( "\u0000" );<br> ins( letter, digit, escape, space, quote, empty );<br> <br> Action<State,TokenType> save = new SaveAction( this );<br>
Action<State,TokenType> skip = new SkipAction( this );<br> Action<State,TokenType> halt = new HaltAction( this );<br> Action<State,TokenType> emit = new EmitAction( this );<br> <br>
StateTransition<State,TokenType> t1 = new StateTransition<State,TokenType>( State.S0, letter, save, State.IDENT );<br> StateTransition<State,TokenType> t2 = new StateTransition<State,TokenType>( State.S0, digit, save, State.NUMBER );<br>
StateTransition<State,TokenType> t3 = new StateTransition<State,TokenType>( State.S0, quote, skip, State.STRING );<br> StateTransition<State,TokenType> t4 = new StateTransition<State,TokenType>( State.S0, space, skip );<br>
StateTransition<State,TokenType> t5 = new StateTransition<State,TokenType>( State.S0, empty, halt );<br> ins( t1, t2, t3, t4, t5 );<br> <br> StateTransition<State,TokenType> u1 = new StateTransition<State,TokenType>( State.IDENT, letter, save );<br>
StateTransition<State,TokenType> u2 = new StateTransition<State,TokenType>( State.IDENT, digit, save );<br> StateTransition<State,TokenType> u3 = new StateTransition<State,TokenType>( State.IDENT, space, emit, State.S0 );<br>
StateTransition<State,TokenType> u4 = new StateTransition<State,TokenType>( State.IDENT, empty, halt );<br> ins( u1, u2, u3, u4 );<br><br> StateTransition<State,TokenType> v1 = new StateTransition<State,TokenType>( State.NUMBER, digit, save );<br>
StateTransition<State,TokenType> v2 = new StateTransition<State,TokenType>( State.NUMBER, space, emit, State.S0 );<br> StateTransition<State,TokenType> v3 = new StateTransition<State,TokenType>( State.NUMBER, empty, halt );<br>
ins( v1, v2, v3 );<br><br> StateTransition<State,TokenType> w1 = new StateTransition<State,TokenType>( State.STRING, letter, save );<br> StateTransition<State,TokenType> w2 = new StateTransition<State,TokenType>( State.STRING, digit, save );<br>
StateTransition<State,TokenType> w3 = new StateTransition<State,TokenType>( State.STRING, space, save );<br> StateTransition<State,TokenType> w4 = new StateTransition<State,TokenType>( State.STRING, quote, emit, State.S0 );<br>
StateTransition<State,TokenType> w5 = new StateTransition<State,TokenType>( State.STRING, escape, save, State.ESCAPE );<br> StateTransition<State,TokenType> w6 = new StateTransition<State,TokenType>( State.STRING, empty, halt );<br>
ins( w1, w2, w3, w4, w5, w6 );<br><br> StateTransition<State,TokenType> x1 = new StateTransition<State,TokenType>( State.ESCAPE, letter, save, State.STRING );<br> StateTransition<State,TokenType> x2 = new StateTransition<State,TokenType>( State.ESCAPE, digit, save, State.STRING );<br>
StateTransition<State,TokenType> x3 = new StateTransition<State,TokenType>( State.ESCAPE, space, save, State.STRING );<br> StateTransition<State,TokenType> x4 = new StateTransition<State,TokenType>( State.ESCAPE, quote, emit, State.STRING );<br>
StateTransition<State,TokenType> x5 = new StateTransition<State,TokenType>( State.ESCAPE, escape, save, State.STRING );<br> StateTransition<State,TokenType> x6 = new StateTransition<State,TokenType>( State.ESCAPE, empty, halt );<br>
ins( x1, x2, x3, x4, x5, x6 );<br><br>-W<br><br><div class="gmail_quote">On Wed, Sep 2, 2009 at 11:54 PM, André Thieme <span dir="ltr"><<a href="mailto:address.good.until.2009.dec.14@justmail.de" target="_blank">address.good.until.2009.dec.14@justmail.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hello group, I recently had the idea:<br>
"A rule system (like Drools) is ideal for making programs with complex<br>
rules simpler. Writing a lexer or parser can be non-trivial. So, is it<br>
possible and also meaningful to express such a task with rules?"<br>
<br>
Anyone here who maybe tried that already?<br>
The two big questions for me are:<br>
1) how easy is it to express a lexer with rules?<br>
2) how bad (good?) will it perform?<br>
<br>
If you happen to have a good idea of how to do it, could you please give<br>
me an example for a simple lexer?<br>
Let's say it will get natural language (a string, such as this email) as<br>
input and should return a sequence (say, ArrayList) of Tokens, which may<br>
look like this:<br>
<br>
public class Token {<br>
public String value;<br>
public String category;<br>
<br>
Token(String value, String category) {<br>
this.value = value;<br>
this.category = category;<br>
}<br>
}<br>
<br>
We could have three categories:<br>
"word", "numeric" and "whitespace".<br>
<br>
An input String could be:<br>
"We can see 500 cars"<br>
And it should produce an ArrayList with the contents:<br>
[<br>
Token("We", "word"),<br>
Token(" ", "whitespace"),<br>
Token("can", "word"),<br>
Token(" ", "whitespace"),<br>
Token("see", "word"),<br>
Token(" ", "whitespace"),<br>
Token("500", "numeric"),<br>
Token(" ", "whitespace"),<br>
Token("cars", "word")<br>
]<br>
<br>
At the moment I have difficulties to see if/how this could be achieved.<br>
If you find this easy, please post a solution.<br>
I am aware that JavaCC is really good for such tasks and will also<br>
perform extremly well.<br>
<br>
<br>
Greetings,<br>
André<br>
_______________________________________________<br>
rules-users mailing list<br>
<a href="mailto:rules-users@lists.jboss.org" target="_blank">rules-users@lists.jboss.org</a><br>
<a href="https://lists.jboss.org/mailman/listinfo/rules-users" target="_blank">https://lists.jboss.org/mailman/listinfo/rules-users</a><br>
</blockquote></div><br>