Essays/Recursive Descent Parser
Parser
Parser applies grammar rules of a formal language to an input stream of tokens. If successful, the parsing process signifies that the input indeed represents correct sequence from the language; the output of parsing can be an AST (abstract syntax tree) of the grammatical structure of language elements or the result of structured evaluation described by the sequence.
Recursive Descent
Recursive descent is a top-down algorithm consisting of mutually recursive operations, such as traversal of a tree structure or parsing. Download script: File:Recdesc.ijs
NB. recdesc - recursive descent parser NB. 10/30/06 Oleg Kobchenko NB. 04/05/07 Oleg Kobchenko literate NB. 10/09/07 Oleg Kobchenko consistent with recdesc2 «grammar» «lexer utilities» NB. ========================================================= NB. common single class coclass 'pcalc' NB. tokens 'S_SPACE S_NUM S_ADD S_SUB S_MUL S_DIV S_LPAR S_RPAR S_ERR'=: i. 9 S_EOF=: _1 destroy=: codestroy NB. ========================================================= NB. Regex Lexer, Token Pump and Accessors «lexer» «parser class» «token pump» «grammar rules» «runner» «public interface» «test»
Simple Calculator
As a simple example, we will describe the language of an arithmetic calculator with traditional precedence rules: [{{#file: "grammar"}} Download script: grammar ]
NB. The following grammar rules are used: NB. NB. factor ::= ('+'|'-')* ( num | '(' expr ')' ) NB. term ::= factor ( ('*'|'%') factor )* NB. expr ::= term ( ('+'|'-') term )* NB. line ::= expr EOF
such that it will be able to evaluate the following expressions [{{#file: "accept test"}} Download script: accept test ]
NB. accept calc '11+22' calc '5+2*10' calc '(5+2)*10' calc '(11+22)/-(3.0*2/2)' calc '(11+22)*+(-1-2)'
and raise errors on such expressions [{{#file: "reject test"}} Download script: reject test ]
NB. reject calc '22+3/' calc '22+3/(1+)' calc '1+abc/2'
Lexer
Using Regex Lexer. [{{#file: "lexer utilities"}} Download script: lexer utilities ]
NB. ========================================================= NB. regex lexer utilitites require 'regex' coclass 'z' lxdefine=: 1 : '''(?x)'' , m : 0' lxresult=: ((1 i.~[:*@,(1 1,:_ 1)&(];.0)) , {.)"2 lxmatches=: lxresult@rxmatches lxfrom=: >@(([:' ['&,,&']')&.>)@rxfrom lxview=: lxmatches (":@[ ,. }."1@[ lxfrom ]) ]
Lexer will reside in the same class as parser, pcalc. We define the following tokens [{{#file: "lexer"}} Download script: lexer ]
NB. ========================================================= NB. lexer part LEX1=: noun lxdefine ( \s+ )|# 0 space ( [0-9.]+ )|# 1 number ( \+ )|# 2 add ( - )|# 3 sub ( \* )|# 4 mul ( / )|# 5 div ( \( )|# 6 lpar ( \) )|# 7 rpar ( .+ ) # 8 error )
The following test cases [{{#file: "lexer test"}} Download script: lexer test ]
LEX1_pcalc_ lxview '5+2*10' LEX1_pcalc_ lxview '(3.0*2/2)'
will produce such scanning results
LEX1_pcalc_ lxview '5+2*10' 1 0 1 [5] 2 1 1 [+] 1 2 1 [2] 4 3 1 [*] 1 4 2 [10] LEX1_pcalc_ lxview '(3.0*2/2)' 6 0 1 [(] 1 1 3 [3.0] 4 4 1 [*] 1 5 1 [2] 5 6 1 [/] 1 7 1 [2] 7 8 1 [)]
Parser Class
The class will also contain creation, destruction and maintenance routines. Each instance will accept an input string, stored in STR. The list of tokens from scanner will be in TOK. [{{#file: "parser class"}} Download script: parser class ]
NB. ========================================================= NB. creation create=: 3 : 0 STR=: y TOK=: LEX1 lxmatches STR I=: _1 if. (#TOK)>ie=. S_ERR i.~{."1 TOK do. 'characters' unexpect ie end. next'' )
The static cover verb run, calls the top grammar rule line. [{{#file: "runner"}} Download script: runner ]
NB. ========================================================= NB. runner run=: 3 : 0 p=. '' try. p=. y conew 'pcalc' r=. line__p'' catch. smoutput 13!:12 '' r=. i.0 0 end. if. #p do. destroy__p'' end. r )
Verb calc is the public interface for the parser. It accepts a string and returns the result. [{{#file: "public interface"}} Download script: public interface ]
NB. ========================================================= NB. public interface calc_z_=: run_pcalc_
Token pump and accessor verbs will be defined as follows. Note, at the end of tokens, an implicit token EOF is inserted. [{{#file: "token pump"}} Download script: token pump ]
NB. ========================================================= NB. token pump and accessors next=: 3 : 0 I=: I+1 if. I<#TOK do. SYM=: {.I{TOK else. SYM=: S_EOF end. 1 ) str=: 3 : 0 if. y>:#TOK do. 'EOF' return. end. 's b n'=. y{TOK (b,:n)];.0 STR ) pos=: 3 : 0 if. y>:#TOK do. _1 return. end. 1{y{TOK ) val=: 3 : '0".str y' unexpect=: 'symbol'&$: : (4 : 0) y=. {.y,I error=. 'Unexpected ',x,' "',(str y),'" at ',":pos y error assert 0 )
Grammar Rules
First pump interface verbs [{{#file: "grammar rules"}} Download script: grammar rules ]
NB. ========================================================= NB. parser part accept=: 3 : '0:`next@.(SYM&=) y' expect=: unexpect`1:@.accept
Finally the grammar rules [{{#file: "grammar rules"}} Download script: grammar rules ]
NB. ========================================================= NB. grammar rules NB. factor ::= ('+'|'-')* ( num | '(' expr ')' ) factor=: 3 : 0 s=. 1 while. 1 do. if. accept S_SUB do. s=. s*_1 elseif. accept S_ADD do. elseif. do. break. end. end. if. accept S_NUM do. r=. val I-1 elseif. accept S_LPAR do. r=. expr '' expect S_RPAR elseif. do. 'factor' unexpect '' end. s*r ) NB. term ::= factor ( ('*'|'%') factor )* term=: 3 : 0 r=. factor '' while. 1 do. if. accept S_MUL do. r=. r * factor'' elseif. accept S_DIV do. r=. r % factor'' elseif. do. break. end. end. r ) NB. expr ::= term ( ('+'|'-') term )* expr=: 3 : 0 r=. term '' while. 1 do. if. accept S_ADD do. r=. r + term'' elseif. accept S_SUB do. r=. r - term'' elseif. do. break. end. end. r ) NB. line ::= expr EOF line=: 3 : 0 r=. expr '' expect S_EOF r )
Testing
The following are some test results, [{{#file: "test"}} Download script: test ]
NB. ========================================================= Note 'Tests' «lexer test» «accept test» «reject test» )
which is what we expect.
calc '11+22' 33 calc '5+2*10' 25 calc '(5+2)*10' 70 calc '(11+22)/-(3.0*2/2)' _11 calc '(11+22)*+(-1-2)' _99 calc '22+3/' Unexpected factor "EOF" at _1 calc '22+3/(1+)' Unexpected factor ")" at 8 calc '1+abc/2' Unexpected characters "abc/2" at 2
Further Enhancements
The resulting parser gets the job done, but it's not very flexible in terms of extensibility for different token sourses and output targets. So the following enhancements can be considered. (See File:Recdesc2.ijs and diff.)
- provide a separate lexer interface that encapsulates the token pump logic, so that different sources including unterminated streams are possible
accept=: 3 : '0:`next__lex@.sym__lex y' expect=: 3 : 'unexpect__lex`1:@.accept y'
- provide a separate consumer class family with AST node creation interface that will be called by the parser rules
term=: 3 : 0 r=. factor '' while. 1 do. if. accept S_MUL do. r=. r '*' binop__cons factor'' elseif. accept S_DIV do. r=. r '%' binop__cons factor'' elseif. do. break. end. end. r )
- implementation of consumers will decide how to treat the calls: calculate, build tree, etc. Here is a complete consumer which builds a boxed tree (See results below).
NB. ========================================================= NB. Tree Consumer, static verbs coclass 'pCalcConsTree' binop =: 1 : (':';'<x,m;y') unop =: 1 : '<m;y' val =: <
- the cover public verbs will use the provider pattern
eval_z_=: 'pCalcConsEval'&run_pCalcParse_ tree_z_=: 'pCalcConsTree'&run_pCalcParse_
- better error handling using asserts -- releasing objects and reporting last error -- via the lexer, because it contains the token positions and values
- skip space tokens in the lexer transparently to the parser
Here are some results of the tree consumer and new error messages.
tree '5 + 2*10' +-+-+--------+ |5|+|+-+-+--+| | | ||2|*|10|| | | |+-+-+--+| +-+-+--------+ tree '22+3/(1+)' |Unexpected factor ")" at 8: assert | error assert 0
JSON Parser Example
JSON (JavaScript Object Notation) is a lightweight data-interchange format based on declarative syntax for JavaScript structural constant definitions. Besides primitive string and numeric constants it has lists [elm1, elm2, ...] and hash tables {key1 : val1, key2 : val2, ...}, whose elements are recursively defined.
For lexer here we use J ;: Words primitive, which almost perfectly suits our needs, with a few limitations: strings must be enclosed in apostrophes '', colon must be preceded by space, etc.
Both examples have the following grammar. A notable feature is that the bracketed structures are split between two entries.
NB. Val ::= '[' List NB. | '{' Hash NB. | token NB. List ::= ']' NB. | Val (',' Val)* ']' NB. Hash ::= '}' NB. | Pair (',' Pair)* '}' NB. Pair ::= Val ':' Val
File:Json.ijs uses a simpler token pump based only on the verb token.
File:Json1.ijs uses the accept/expect pattern as in pcalc above.
Lists are stored as boxed lists and hash tables as boxed two-row tables of keys and values. For example, for input
T1=: 0 : 0 [123, 'ab c', {qq : zz, asb :[12,34]}, [3, 5, [], {a :[]}, zz]] )
the result is
p=. conew'ppar1' >parse__p T1 +---+------+------------+-------------+ |123|'ab c'|+--+-------+|+-+-++---+--+| | | ||qq|asb |||3|5||+-+|zz|| | | |+--+-------+|| | |||a|| || | | ||zz|+--+--+||| | ||+-+| || | | || ||12|34|||| | ||| || || | | || |+--+--+||| | ||+-+| || | | |+--+-------+|+-+-++---+--+| +---+------+------------+-------------+ codestroy__p''
See Also
- Regex Lexer, Scanner built using regex library
- WikiPedia:Recursive_descent, Wikipedia
- WikiPedia:Parsing, Wikipedia