Alex Schroeder/Sequential State Machine Introduction
I wanted to read integers from files, no matter whether they were separated by newlines, commas, or anything else. I needed a parser that extracted numbers from a string and I didn't understand the Sequential Machine example.
First, create an array mapping every input byte to a code. In this case, we create an array with 256 zeroes, and then we set the code to 1 for every digit.
m=: 256$0 m=: 1 (a.i.'0123456789')}m
The result looks something like this: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... – as you can see there are a bunch of ones in there. :)
Next we need a state transition table. It has 3 dimensions.
- the first axis is the current state we're in: 0 means we're skipping stuff, 1 means we're reading a number – we don't need any more states
- the second axis is the current code we're looking at: 0 means we're looking at some byte that's not a digit, 1 means we're looking at a digit (this is where m is used)
- the third axis is simply two integers: the first one is the new state (we only have 0 or 1 to choose from), and the second one is what to do (the output code: 0 means doing nothing, 1 means start a word, 3 means emit a word and reset)
OK, so here's how I built it:
NB. state 0 state 1 s=: 2 2 2$ 0 0 1 1 0 3 1 0
Or graphically:
<"1 s ┌───┬───┐ │0 0│1 1│ ├───┼───┤ │0 3│1 0│ └───┴───┘
Rows are the current state, columns are the input code we're looking at.
Can you see it? I started thinking about it like this:
- given a string such as '42 16', starting at index 0, in state 0, with j being the beginning of a word pointing at -1 (in other words, no word is started) ...
- we look at '4' and get the new state from our m: 1 (as we're looking at a digit)
- given this information, we need to look at the cell [1 1] (current state 0, input code 1)
- the new state is set to 1, and output code 1 means we begin a new word at the current index 0
- we look at '2' and get the new state from our m: still 1
- now we're looking at the cell [1 0] (current state 1, input code 1)
- the new state continues set to 1, and output code 0 means nothing happens
- we look at the space character and get the new state from our m: 0
- now we're looking at the cell [0 3] (current state 1, input code 0)
- the new state changes to 0, and output code 3 means we emit a word, starting from index 0 (that's when we started it) up to the current position ("42") and then we set j to -1 again
And so on.
The output code 3 means we emit a word and we do *not* begin a new word. This is important at the end: we could have used the output code 2 but that means we emit a word and begin a new one (j remains set), and the result is that any trailing garbage at the end is turned into a word.
And now the parser works for everything:
(0;s;m) ;: '42x16xx1' ┌──┬──┬─┐ │42│16│1│ └──┴──┴─┘
You might be wondering about the 0 at the beginning of that parser definition. The answer is that this tells the parser what to do with the words it finds. 0 means that the words end up in boxes. 2 means the word index and length, for example:
(2;s;m) ;: '42x16xx1' 0 2 3 2 7 1