Essays/Regex Lexer
Lexer (or scanner) is a program that performs lexical analysis, breaking a stream of characters into a list of consecutive groups, called tokens. Download File:Rxlex.ijs.
Lexer
Lexical analysis is typically performed by a sequential machine, such as dyad ;: Sequential Machine. However, it requires the state table, which is very hard to construct by hand for anything but very simple rules.
To simplify lexer construction, tools such as UNIX lex are used, which are based on regular expressions. And implementations of regular expressions matching, themselves employ state machines.
So having regular expressions facility, such as regex library in J, it is already possible to build a decent lexer in a fairly straightforward way.
Principles
We will start with a simple case. Tokens are alphabetical words separated with spaces:
TEST1a=: 'one two three' NB. normal test TEST1b=: 'one two1 three' NB. error TEST1c=: ($~ 1000*#)'one two three ' NB. 16000 chars
To represent a token, we will use a regex group, i.e. a subpattern within (...). An instance of match will be the disjunction of such groups, (...)|(...)|...|(...). Then the token ID will be the index of the group in the match array.
Thus we obtain for our case,
LEX1=: '(\s+)|([a-z]+)|(.+)' NB. space | word | error
We are ready to roll. But first we apply some notation sugar Utilities at the end of this text.
LEX1 lxview TEST1a 1 0 3 [one] 0 3 1 [ ] 1 4 3 [two] 0 7 3 [ ] 1 10 5 [three] LEX1 lxview TEST1b 1 0 3 [one] 0 3 1 [ ] 1 4 3 [two] 2 7 7 [1 three]
The format of lxview is character rows per token of (index,start,size,content).
It uses numeric matrix of lxmatches with rows of (index,start,size).
Performance is quite reasonable, considering the amount of effort to define the lexer.
ts 'LEX1 lxmatches TEST1c' 0.903204 396032
Details
As with other lexer builders, there are certain factors that need to be considered to obtain a successful lexer.
Defining Groups
Groups in the regex correspond to token types, so it is important not to introduce any groups within or outside tokens. Instead, we use non-capturing subpatterns (?:...).
The groups are matched on found-first basis, so care should be taken about the order of groups.
Error Handling
Error in lexical analysis can be thought of as a non-match, or in other words, a gap between two matches.
While it could be possible to handle a non-match in the loop external to regex, it is simpler to delegate it to regex itself by providing the last all-absorbing group, e.g. (.+), whose maximum index will signify that all previous groups failed.
Extended Definition
For more comples lexers one line of regex is not convenient, so we will use extended syntax that will allow us to format the definition with one line per token and convenient comments.
Here is how the same case will look like,
LEX2=: noun lxdefine ( \s+ )|# 0 space ( [a-z]+ )|# 1 words ( .+ ) # 2 error )
With identical result
(LEX1 lxmatches TEST1a) -: LEX2 lxmatches TEST1a 1 (LEX1 lxmatches TEST1b) -: LEX2 lxmatches TEST1b 1
Note the time is the same, since it is compiled into the same form,
ts 'LEX2 lxmatches TEST1c' 0.882967 396032
More Complex Example
Here we will create a prototype of XML lexer to handle input, such as
TEST3a=: '<a>qq</b>' TEST3b=: ' <? zz ?><a>qq</b>' TEST3c=: '<a> <!-- cc --> </b>' TEST3d=: ' <? zz ?><a>qq</b><error' TEST3e=: ' <? zz ?><a>qq</b>< error>' TEST3f=: ($~ 1000*#)' <? zz ?><a>qq</b>' NB. 18000 chars
We will only care about discerning various tags and text, without going into the tag level.
LEX3=: noun lxdefine ( <[a-z][^>]*> )|# 0 beg tag ( </[a-z][^>]*> )|# 1 end tag ( <\?(?:[^?]|\?[^>])*\?> )|# 2 pi ( <!--(?:[^-]|-[^-]|--[^-]|)*--> )|# 3 comment ( [^<>]+ )|# 4 text ( .+ ) # 5 error )
Here's what we've got
LEX3 lxview TEST3a 0 0 3 [<a>] 4 3 2 [qq] 1 5 4 [</b>] LEX3 lxview TEST3b 4 0 1 [ ] 2 1 8 [<? zz ?>] 0 9 3 [<a>] 4 12 2 [qq] 1 14 4 [</b>] LEX3 lxview TEST3c 0 0 3 [<a>] 4 3 1 [ ] 3 4 11 [<!-- cc -->] 4 15 1 [ ] 1 16 4 [</b>] LEX3 lxview TEST3d 4 0 1 [ ] 2 1 8 [<? zz ?>] 0 9 3 [<a>] 4 12 2 [qq] 1 14 4 [</b>] 5 18 6 [<error] LEX3 lxview TEST3e 4 0 1 [ ] 2 1 8 [<? zz ?>] 0 9 3 [<a>] 4 12 2 [qq] 1 14 4 [</b>] 5 18 8 [< error>]
A pleasant surprise is that a more complex construct with more tokens takes less time
ts 'LEX3 lxmatches TEST3f' 0.784982 789248
Case Study: Word Formation on Lines
Roughly according to Word Formation on Lines we build a lexer. Because of intuitive nature of regex, it takes only about 30 min to incrementally build an observable lexer.
LEX4=: noun lxdefine ( [0-9_][0-9a-z_\t .]*[.:]* )|# 0 numeric ( NB\..* )|# 1 comment ( [a-z][0-9a-z_]*[.:]* )|# 2 alpha ( '(''|[^'])*' )|# 3 string ( \r?\n|\r )|# 4 line break ( [ \t]+ )|# 5 space ( .[.:]* )|# 6 symbol ( .+ ) # 7 error )
The following test cases were used
... TEST4g=: 'ab_c: -./ i.2 3 4' TEST4h=: 'ab_c: -./ i.2 3 NB. 3 4' TEST4i=: 'ab ; ''cd''''ef'' ,": 2 3 4' TEST4j=: '1 2+3 2',LF,'4 2',CR,'5 2',CRLF,'6 2'
An example of output
LEX4 lxview TEST4i 2 0 2 [ab] 6 2 1 [ ] 7 3 1 [;] 6 4 1 [ ] 3 5 8 ['cd''ef'] 6 13 1 [ ] 7 14 1 [,] 7 15 2 [":] 6 17 1 [ ] 0 18 5 [2 3 4]
However, performance of large input somewhat degrades. It needs to be further investigated if it is in J code calling regex library or intrinsically nature of regex. Here's Performance Monitor output
z=: 1e5 $ 'abcolumb+boustrophedonic-chthonic*' load 'jpm' start_jpm_ '' 357142 $LEX4 lxmatches z 17647 3 showtotal_jpm_'' Time (seconds) +--------+------+--------+--------+-----+----+----+ |name |locale|all |here |here%|cum%|rep | +--------+------+--------+--------+-----+----+----+ |rxmatch |jregex|1.750778|1.040360| 57.6| 58 |7763| |jregexec|jregex|0.688747|0.688747| 38.1| 96 |7763| |[rest] | | |0.078568| 4.3|100 | | |[total] | | |1.807675|100.0|100 | | +--------+------+--------+--------+-----+----+----+ 0 1 showdetail_jpm_'rxmatch_jregex_' Time (seconds) +--------+--------+----+-------------------------------------------------------------------+ |all |here |rep |rxmatch_jregex_ | +--------+--------+----+-------------------------------------------------------------------+ |0.054887|0.054887|7763|dyad | |0.082529|0.082529|7763|[0] if. lb=.32=3!:0 x do. ph=.>0{x else. ph=.x end. | |0.048568|0.048568|7763|[1] if. cx=.2=3!:0 ph do. hx=.rxcomp ph | |0.075010|0.075010| 0|[2] else. rxlastxrp=:>1{((hx=.ph)-1)({"1)rxpatterns end. | |0.060418|0.038747|7763|[3] nsub=.rxnsub rxlastxrp | |0.912908|0.224161|7763|[4] rxlastrc=:>0{rv=.jregexec rxlastxrp;(,y);rxms;((2*rxms)$_1 0);0| |0.043624|0.043624|7763|[5] if. cx do. rxfree hx end. | |0.045776|0.045776|7763|[6] m=.(nsub,2)$>4{rv | |0.039382|0.039382|7763|[7] t=.(0{"1 m) | |0.159160|0.159160|7763|[8] m=.t,.-~/"1 m | |0.085159|0.085159|7763|[9] m=._1 0((t=_1)#i.#t)}m | |0.143358|0.143358|7763|[10] if. lb do. (>1{x){m else. m end. | |1.750778|1.040360|7763|total dyad | +--------+--------+----+-------------------------------------------------------------------+ Space (bytes) +-----------+-----------+----+-------------------------------------------------------------------+ |all |here |rep |rxmatch_jregex_ | +-----------+-----------+----+-------------------------------------------------------------------+ |259,553,664|259,553,664|7763|dyad | | 10,433,472| 10,433,472|7763|[0] if. lb=.32=3!:0 x do. ph=.>0{x else. ph=.x end. | | 2,980,608| 2,980,608|7763|[1] if. cx=.2=3!:0 ph do. hx=.rxcomp ph | | 4,471,488| 4,471,488| 0|[2] else. rxlastxrp=:>1{((hx=.ph)-1)({"1)rxpatterns end. | | 1,490,496| 1,490,496|7763|[3] nsub=.rxnsub rxlastxrp | |270,483,968| 1,490,496|7763|[4] rxlastrc=:>0{rv=.jregexec rxlastxrp;(,y);rxms;((2*rxms)$_1 0);0| | 496,832| 496,832|7763|[5] if. cx do. rxfree hx end. | | 0| 0|7763|[6] m=.(nsub,2)$>4{rv | | 3,974,656| 3,974,656|7763|[7] t=.(0{"1 m) | | 1,490,496| 1,490,496|7763|[8] m=.t,.-~/"1 m | | 7,949,312| 7,949,312|7763|[9] m=._1 0((t=_1)#i.#t)}m | | 1,490,496| 1,490,496|7763|[10] if. lb do. (>1{x){m else. m end. | |564,815,488|295,822,016|7763|total dyad | +-----------+-----------+----+-------------------------------------------------------------------+
Further Enhancements
Some very simple enhancements will make it possible to have more powerful functionality preserving the simplicity of design.
- Our lexer so far uses rxmatches, which contains a loop. If we make our own loop, we would have more control over processing tokens, such as non-match handling.
- If we own the loop, we could make the contexts feature, i.e. have several sets of groups, which are switched based on occurence of a particular token, i.e. a simple high-level state machine.
- The comment space in the definition can be used to host J expressions to process the tokens on the go, the evaluator stage.
Utilities
These definitions do not have any additional functionality, just facilities to present input and format results.
require 'regex' lxdefine=: 1 : '''(?x)'' , m : 0' lxresult=: ((1 i.~[:*@,(1 1,:_ 1)&(];.0)) , {.)"2 lxmatches=: lxresult@rxmatches lxfrom=: >@(([:' ['&,,&']')&.>)@rxfrom lxview=: lxmatches (":@[ ,. }."1@[ lxfrom ]) ]
See Also
- Regular Expressions Lab, Guide to regex library
- Recursive Descent Parser, Top-down parser built using Regex Lexer
- WikiPedia:Lexical_analysis, Wikipedia
- WikiPedia:Lex_programming_tool, Wikipedia
Contributed by Oleg Kobchenko