Talk:Essays/RegularExpressions/NondeterministicFiniteAutomata
<syntaxhighlight lang="j" line>
In this essay, I attempted to use the line numbering feature of the syntaxhighlight tag to distinguish between "script window" content and "console window" content.
It works rather well, but not as well as I would like.
Ideally, I'd like a supporting tool to deal with issues like tracking line numbers, so that I could quickly add content and also easily go back and edit earlier content without having to manually renumber everything. (Consequently, I've abandoned this approach in the current draft for "chapter 2".
Also, I should be representing a blank line between each quoted section of code, but I have not thought of a good way of doing that. Maybe just skipping a line number would be the right approach, once people got used to it? --Raul Miller (talk) 15:39, 20 January 2024 (UTC)
Tokenization
I gave some thought to tokenization which I'll summarize here. I'll call it "group capture" since that brings more focus on the actual process. The idea being to take some part of the regular expression and find out how much of a successful parse covered that section. So if you have a regular expression like '(((ab)+)|((ac)+))d' you might be interested in knowing how much of the string being processed is ab's.
To discuss this more effectively, it's helpful to take a step back. The algorithm you've provided for parsing regular expressions is an NFA. So we have a set of starting states and a state machine we use to walk through the state machine as we process input. Each character of the input takes the current set of states and uses the state machine to find the next set of steps. A successful parse has the "end state" as an element of the final set of states after having processed all the information.
But more specifically, we build up state machines from chf constructions which take raw text as input. All other constructions take output of other constructions as input. So these are effectively the "leaves" of the construction. Moreover, all of the constructions create state machines with as many rows as the sum of those of its arguments. I.e. they manipulate either the content of the input constructions or the initial state, but never add new rows. Long story short, our state machine ends up having as many rows as "character classes" in our regular expression. (Here I extend character classes to include single character matches). So for the regular expression above we end up with a state machine with five rows corresponding to abacd from the regular expression above. Each row corresponds to a state we pass through. So seeing if a parse goes through a section of the regular expression correponds to finding the associated states in the string of states that we passed through during parsing.
To that end, we'll want to adjust match to return not just whether the match succeeded or not, but also the list of states run into during processing. (Setting aside for the moment what a clean interface would look like.)
match=: {{ i=. i.#t ['s t'=. y ss =. 0 {:: y for_c. a.i.x do. s=. ~.;c{"1 t#~ i e. s ss =. s;ss end. (ss);(#t)e.s }}
Which would yield something like the following with the final state coming first in this list.
disp rx ┌───┬────────────┐ │0 2│┌─┬─┬┬─┬─┬─┐│ │ ││ │a││ │ │ ││ │ │├─┼─┼┼─┼─┼─┤│ │ ││b│ ││ │b│ ││ │ │├─┼─┼┼─┼─┼─┤│ │ ││ │ ││a│ │ ││ │ │├─┼─┼┼─┼─┼─┤│ │ ││ │ ││ │c│ ││ │ │├─┼─┼┼─┼─┼─┤│ │ ││ │ ││ │ │d││ │ │└─┴─┴┴─┴─┴─┘│ └───┴────────────┘ 'abababd' match rx ┌───────────────────────────┬─┐ │┌─┬───┬─┬───┬─┬───┬───┬───┐│1│ ││5│0 4│1│0 4│1│0 4│1 3│0 2││ │ │└─┴───┴─┴───┴─┴───┴───┴───┘│ │ └───────────────────────────┴─┘
So, our start states are 0 2 which leads to 1 3 when we consume the first 'a' but the 3 goes nowhere when we consume 'b', whereas 3 leads us to states 0 4, etc.
So reading backwards, we hit the states 0 1 0 1 0 1 4 5 which corresponds to abababd[exit]. So finding the ab's in the input amounts to looking for the 0 1's in the output stream. The hitch is that there are "dead ends" to be weeded out. We seem to have a 2 3 in the output as well, but this doesn't mean we successfully matched ac because when we were in state 3 we didn't actually match a c.
Thus what we want is only to keep states which eventually lead to the "accept state". We can do this by walking through this list and keeping only those states in a given item actually ends up in the list of states in the previous item given the input at that time. We do this recursively so that we remove not only states which lead nowhere, but states which lead to states which lead nowhere, etc.
In the end we end up with only a single state at each step corresponding to the input which was processed.
ii=.'abababd' ]rr=.0{::ii match rx ┌─┬───┬─┬───┬─┬───┬───┬───┐ │5│0 4│1│0 4│1│0 4│1 3│0 2│ └─┴───┴─┴───┴─┴───┴───┴───┘ sm =. 1{::rx prn =. {{<(1{::y)#~ x >:/"1@:e.&> m {~ <@,~"0&>/y}} ({. rr) ]F:: (sm prn~) |. (<"0 a.i.|.ii),"0 }.rr ┌─┬─┬─┬─┬─┬─┬─┐ │4│1│0│1│0│1│0│ └─┴─┴─┴─┴─┴─┴─┘
Et voila! This is the meat of the solution. Still to be worked out is a good API for describing what you want captured. I haven't worked this out but I suspect that using a parse tree for the regular expression would be of some use.
Doug Mennella (talk) 02:19, 19 September 2024 (UTC)
- Yes... the accept state is a mandatory part of the matched group, and using it does help discard some "potentially relevant match states". However, there's an additional difficulty which is worth considering:
- When there's multiple match groups, and there's multiple "accept states" for each match group, we have to pick one of those "accept states" for each group. But if we're too greedy (or insufficiently greedy) in our early groups, this may prevent matching later some (mandatory matches) for later parts of the regular expression.
- For a relatively simple example, consider matching
^((ac)+c)((ac)+c)((ac)+c)$
against the string "acaccacacacaccacaccacacaccacacc".
- So... we start at the last accept state and work backwards. But I think, in the general case, this involves something similar to matching a "meta regular expression" against our internal state representation --Raul Miller (talk) 20:43, 19 September 2024 (UTC)
- I’ve read up a bit recently on TNFA’s (tagged NFA’s) which can handle these cases of ambiguous matches, but to be honest they don’t seem like things which would come up in real world examples.
- I.e. I’d probably rather have a half a loaf than no loaf at all.
- Doug Mennella (talk) 02:02, 20 September 2024 (UTC)
- Also, I think the example you gave is not ambiguous. You need an “or” somewhere to have alternate paths. There would be nine states in this state machine, not three. It also doesn’t match that string. To be clear, you still run the state machine forwards, you just test the paths backwards. You could also employ something like a lookahead to do it in a single pass but that’s beside the point.
- Doug Mennella (talk) 05:06, 20 September 2024 (UTC)
- Pathological cases (where a regex match takes minutes or hours) tend to be rare but do occur in real life (I've encountered them in web publishing contexts and I've known other people who have encountered them in this kind of context.) When they occur (in a large body of code) debugging is a slow realization process. That said, the example I remember involved a production pcre implementation and I don't remember whether match groups were involved.
- That said, it's probably also worth noting that regex implementations get used in dna analysis (a human dna molecule might be roughly represented as something like
'acgt'{~?3e9#4
(though it looks like these days 'acgu' gets used rather than 'acgt'). - Anyways... building a functioning implementation model would be a good start, but I think it would be worth revisiting this issue once the initial hurdles have been overcome.
- Though, granted, it's a hard problem. (And the pathological behavior of professional regex implementations has been what has motivated my interest in the subject.) --Raul Miller (talk) 12:15, 20 September 2024 (UTC)
- Oh, okay. I'll leave it to you to finish this off then.
- Doug Mennella (talk) 12:20, 20 September 2024 (UTC)