User:Tom McGuire/AITextProcessing
Introduction
AI as it is practiced in 2024 involves setting up large neural network in various configurations and simulating them on a computer. The simulation methods at their base level revolve around matrix multiplication. This would seem obvious to anyone who has played with a 3 layer backpropagation network. Inputs are fed into a layer of neuron functions. The outputs of each layer are fed into the next layer and we produce an output on the output layer. Every input goes to every neuron in the input layer. Every output goes to ever neuron in the next layer. Weights are applied to each input to a neuron and therefore the layer can be represented by a matrix of weights. The neurons are usually selected to each implement the same output function to simplify coding.
There seems to be mention of tensor operations when looking at libraries such as PyTorch, TensorFlow. llama.cpp etc. But so far in my investigation ordinary matrix manipulation and multiplication is all that is needed. The major issue becomes the scale of these operations for use in AI. Words are turned into vectors, vectors are collected in matrices, neural networks are matrices that must align on the dimension of the word vectors. The sheer size of the data matrices being pushed around in memory becomes quite large quite fast. To speed up processing of these large matrices the AI industry has come to rely on graphics cards with high numbers of GPUs. The library layers that sit on the graphics cards are essentially gclc code that gets compiled into a set of standing operations to perform in parallel thereby speeding up the task of calculating for AI.
You would think that J or array programming languages would be the obvious choice to use in this environment. Had AI been able to be built on top of the CPUs and memory that comes in a standard but high level computer it probably would be. But desktop computers have always been schizophrenic in the types of operations they perform. Graphics are applied by sending operations to a driver program that interfaces to a graphics card. That graphics card gets output on a screen for humans to interact with. This controller/peripheral paradigm splits functionality. J (and most array languages) operate on the CPU to perform general purpose calculations. The GPU programming became the domain of graphics operations and interfacing with a C/C++ driver program that operates the GPU. Further separating out the graphics space is the fact that for gaming and ray tracing you need to program the GPU directly. There is only one way to do this and that is through a shading language that gets compiled into instructions the Graphics card can understand. This gets fragmented again because Graphics cards from different manufacturers support different shading languages.
My belief is had the graphics card industry not been so fragmented. An array language may have been able to provide direct language support for graphics cards. Instead shading languages loosely resemble C language, each major manufacturer has their own version, drivers are writtent in C. Gaming engines also helped to fragment this further by writing the engine in C or C++ and then each engine providing their own scripting language or support to some other popular language.
Now AI comes along and needs the use of the Graphics card and they have gone the same route as the Gaming engine field. They write code that sits on top of the graphics card (mainly NVIDIA cards) then they write a scripting language interface in their favorite language. Due to its pupularity, its ability to adapt via its foreign function interface, Python has become the high level language of the AI space. Some of that driven by NumPy an array language implementation which adapts itself into the python object space.
Basic Tokenizer
The BasicTokenizer just gets words from a string and separates them out into their own strings. In J boxing is probably the easiest representation. In the April 2024 NYCJUG Devon McCormick was presented a beginners regata that was the beginning of this process. In summary he wanted to get white space between words down to a single TAB character in between each word. He had tried a couple of J idioms combined to make this happen and then he came to the 'cut' verb ';:'.
spaces2TAB=: 13 : '}:;TAB,~&.>;:y' spaces2TAB [: }: [: ; (a.{~9) ,~&.> ;: samp=. 'Here is some text',(' ',TAB,' '),'with',TAB,'different',(' ',TAB),'kinds of whitespace embedded in it. ' ;:samp ┌────┬──┬────┬────┬────┬─────────┬─────┬──┬──────────┬────────┬──┬───┐ │Here│is│some│text│with│different│kinds│of│whitespace│embedded│in│it.│ └────┴──┴────┴────┴────┴─────────┴─────┴──┴──────────┴────────┴──┴───┘ TAB ,~&.> ;:samp ┌─────┬───┬─────┬─────┬─────┬──────────┬──────┬───┬───────────┬─────────┬───┬────┐ │Here │is │some │text │with │different │kinds │of │whitespace │embedded │in │it. │ └─────┴───┴─────┴─────┴─────┴──────────┴──────┴───┴───────────┴─────────┴───┴────┘ NB. Now Devons function: spaces2TAB samp Here is some text with different kinds of whitespace embedded in it.
A quick aside: Devon uses the 'each' operator to place a TAB character at the end of every boxed string. Then he drops the last tab of the sentence. I have used the insert operator ('/') in the past for these sort of things albeit in a kludgy way. In J the insert operator inserts verbs and executes the final form. Using tacit or an anonymous function you can hard code a boxed noun in between each boxed string coming out of the cut opeaator (';:'). I bring this up because, while this works, it seems that there must be a more straight forward way to do this in J. However, I haven't been able to think of it.
NB. using Devon's sample with an initial TAB samp2=. TAB,'Line with initial ',TAB,' tab' samp2 Line with initial tab ;:samp2 ┌────┬────┬───────┬───┐ │Line│with│initial│tab│ └────┴────┴───────┴───┘ NB. some ways to insert a noun {{x,(<TAB),y}}/;: samp2 ┌────┬─┬────┬─┬───────┬─┬───┐ │Line│ │with│ │initial│ │tab│ └────┴─┴────┴─┴───────┴─┴───┘ ([ , (<TAB) , ])/;: samp2 ┌────┬─┬────┬─┬───────┬─┬───┐ │Line│ │with│ │initial│ │tab│ └────┴─┴────┴─┴───────┴─┴───┘ ([ , (<' anystring ') , ])/;: samp2 ┌────┬───────────┬────┬───────────┬───────┬───────────┬───┐ │Line│ anystring │with│ anystring │initial│ anystring │tab│ └────┴───────────┴────┴───────────┴───────┴───────────┴───┘
Basic tokenization
Devon pointed out back in April the limitations of the monadic ';:' operator: "the state machine defaults to the J parser so any sequence of text is broken apart into distinct J tokens; in the case of most text, this mostly corresponds to words.". Specifically it is using the lexical analysis of J to separate out tokens. If you want to separate out English punctuation into its own elenent then 'cut' may or may not du it based on whether the character is a recognized operator or not.
NB. ;: doesn't quite 'cut' it as a basic tokenizer - pretty close though ;: 'However, if I put in some text with J-syntax characters. Then: you get the following:' ┌───────┬─┬──┬─┬───┬──┬────┬────┬────┬─┬─┬──────┬───────────┬─────┬───┬───┬───┬──────────┐ │However│,│if│I│put│in│some│text│with│J│-│syntax│characters.│Then:│you│get│the│following:│ └───────┴─┴──┴─┴───┴──┴────┴────┴────┴─┴─┴──────┴───────────┴─────┴───┴───┴───┴──────────┘
Devon's example inspired me to bite the bullet and look into the dyadic use of ';:' which is the full state machine operator. In basic tokenizing you need to settle on a domain for the alphabet of the words you will recognize. Since you may want to accept more than one language then you need more than the ability to recognize just the English alphabet. White space is limited to a handful of characters and punctuation to a couple of handfuls more. It turns out we can get pretty far with a Basic Tokenizer in J by just assuming that anything that isn't in our punctuation mark set and isn't in the whitespace set is infact a letter of interest in our alphabet.
NB. state machine to box on whitespace NB. punctuation string punc =: '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~' NB. input character classes mBT =: 256$0 NB. X other mBT =: 1 (9,10,13,a.i.' ') }mBT NB. S whitespace mBT =: 2 (a. i. punc) }mBT NB. P punctuation mBT =: 3 (a. i. '1234567890') }mBT NB. N numbers NB. 4 F End of input NB. State transition table sBT=: _2]\"1 }.".;._2 (0 : 0) ' X S P N F ']0 1 1 2 0 3 1 4 1 5 0 NB. 0 initial 1 0 2 2 3 2 4 2 5 2 NB. 1 other 1 1 2 1 3 1 4 1 5 1 NB. 2 whitespace 1 2 2 2 3 2 4 2 5 2 NB. 3 punctuation 1 2 2 2 3 2 4 0 5 2 NB. 4 numbers 5 6 5 6 5 6 5 6 5 6 NB. 5 final ) BT =: (0;sBT;mBT;0 _1 0 4) & ;: BT 'However, if I put in some text with J-syntax characters. Then: you get the following:' ┌───────┬─┬──┬─┬───┬──┬────┬────┬────┬─┬─┬──────┬──────────┬─┬────┬─┬───┬───┬───┬─────────┬─┐ │However│,│if│I│put│in│some│text│with│J│-│syntax│characters│.│Then│:│you│get│the│following│:│ └───────┴─┴──┴─┴───┴──┴────┴────┴────┴─┴─┴──────┴──────────┴─┴────┴─┴───┴───┴───┴─────────┴─┘
Just to bring this back around full circle we can actually use this sequential machine implementation to solve Devon's original problem
BT samp ┌────┬──┬────┬────┬────┬─────────┬─────┬──┬──────────┬────────┬──┬──┬─┐ │Here│is│some│text│with│different│kinds│of│whitespace│embedded│in│it│.│ └────┴──┴────┴────┴────┴─────────┴─────┴──┴──────────┴────────┴──┴──┴─┘ BT samp2 ┌────┬────┬───────┬───┐ │Line│with│initial│tab│ └────┴────┴───────┴───┘
Once the Basic Tokenizer is working the next step is to apply one of the 3 popular tokenizers. BasePair tokenization, word-piece tokenization, sentence piece tokenization. Word-piece seems pretty popular and looking ther the llama.cpp code it is fairly straight forward to implement in J. The only issue is that it requires looping control statements in J. I haven't found a true J way to implement this.
Now for Something Completely Different - a brute force word-piece tokenizer
I originally came across some Julia code that implements a word-piece tokenizder in a nested loop. It's a short piece of code that you can find at https://github.com/tmcguirefl. However, once I understood the code in Julia I was able to find the equivalent code in llama.cpp. The following is the J implementation based off the llama.cpp code:
NB. the following map implementation is from the old jprogramming mailing list NB. subject line: [Jprogramming] Dictionary with Strings as Keys NB. author: Ian Clark NB. link: http://www.jsoftware.com/pipermail/programming/2017-April/047176.html NB. original lookup definition: NB. lookup=: 4 : '>1{ x {~ ({."1 x) i. <y' NB. NB. Bill Lam map =: _2 ]\;0;'snow';1;'##board';2;'##ing';3 NB. lookup has been changed to return 0 by taking modulo of index NB. the 0 index of the map is a null entry that will return 0 for the value lookup =: {{>1{ x {~ (#x)|({."1 x) i. <y}} NB. generate substring indexes sseq =: [ + [: i. -~ NB. Brute force wordpiece tokenization tranlated to J from Julia implementation at: NB. https://github.com/SeanLee97/BertWordPieceTokenizer.jl/blob/main/src/BertWordPieceTokenizer.jl NB. and NB. from the megatron bert-tokenizer code in python NB. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/tokenizer/bert_tokenization.py NB. tokenize =: 3 : 0 word =. y tokens =: < subtokens =: < 'i j is_bad' =. 0 0 0 if. max_chars_per_word < #y do. <'UNK' return. end. while. i < #y do. j =. #y cur_sub =. while. j > i do. sub =. (i sseq j){word NB. smoutput sub if. i > 0 do. sub =. '##',sub end. tokenid =. map&lookup sub if. tokenid ~: 0 do. cur_sub =. sub break. end. j =. j - 1 end. if. cur_sub -: do. is_bad =. 1 break. end. tokens =. tokens,<cur_sub i =. j end. if. is_bad do. <'[UNK]' return. end. }. tokens )
I placed an smoutput stamement in this code so you can see how processing is performed. I use a match (-: operator) lookup just as proof of concept.
NB. output of the brute force word-piece tokenizer tokenize 'snowboarding' snowboarding snowboardin snowboardi snowboard snowboar snowboa snowbo snowb snow boarding boardin boardi board ing ┌────┬───────┬─────┐ │snow│##board│##ing│ └────┴───────┴─────┘ tokenize 'slowboarding' slowboarding slowboardin slowboardi slowboard slowboar slowboa slowbo slowb slow slo sl s lowboarding lowboardin lowboardi lowboard lowboar lowboa lowbo lowb low lo l owboarding owboardin owboardi owboard owboar owboa owbo owb ow o wboarding wboardin wboardi wboard wboar wboa wbo wb w boarding boardin boardi board ing ┌─────┬─────┬─────┬─────┬───────┬─────┐ │[UNK]│[UNK]│[UNK]│[UNK]│##board│##ing│ └─────┴─────┴─────┴─────┴───────┴─────┘
Trie in J
This trie implementation came about while I was investigating AI architectures specifically Transformer architecture. My major complaint is that the current state of AI is lacking in descriptive details. It is all very "blackbox"-ish. The real code hides beneath the Python subroutine APIs that everyone uses to access the underlying architecture usually written in C/C++. On a cursory internet search there are plenty of articles and discussions about using the python libraries. But very little comes up on what the underlying architecture looks like.
Digging deeper I came across a GOOGLE blog that describes the Transformer AI architecture. This architecture seems to excel at language translation. The main point of the GOOGLE article is how to break up text sentences so they can be fed into the underlying AI neural network. This is called tokenization. Words must be converted into numbers so they can be used by the neural network. There are different types of Tokenization in AI. byte pair, word-piece, sentence-piece. Word-piece Tokenization seemed popular and this was the subject of the GOOGLE blog article. They used a Trie datastructure to provide O(n) Token lookup for breaking up words into meaningful pieces.
So now with a small piece of AI architecture carved out for study. I thought well a Trie is just a Tree. J can implement Trees with arrays so their must be an implementation somewhere. Searching the J Software site I found put that NYCJUG had discussed this topic briefly in 2013. The discussio was over someone's complaint of what is missing in J and Trie was mentioned. Devon included ???? reply to that message in the NYCJUG notes. Who said J can usually implement what you need and many times if you rethink the problem you may find that J can do it in another way that is simpler.
Trie Implementation
This follows the simple matrix version of a trie described in the paper: J. Aoe, K. Morimoto and T. Sato, ‘An efficient implementation of trie structures’, Software—Practice and Experience, 22, 695–721 (1992). https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=f2ad2a67218e458a979afafb311d1d7867f3f624
NB. trie implementation in J NB. both full matrix and sparse matrix implementations in one file NB. cocurrent 'jtrie' NB. constants and a view function NB. taken from Roger Stokes Learning J chapter 30 NB. https://www.jsoftware.com/help/learning/30.htm INTEGERZERO =: 3 - 3 INTEGERONE =: 3 - 2 view =: 0 & $. NB. init_trie will initialized state variables and the empty trie NB. x - 1 for sparse matrix; 0 for full matrix NB. y - the alphabet to use with this trie create=: 3 : 0 'sparse alphabet' =: y NB. '#ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' if. sparse do. trierow =: $. (1,#alphabet)$INTEGERZERO else. trierow =: (1,#alphabet)$INTEGERZERO end. trie =: trierow,trierow current_state=: 1 max_rows=: 1 ) destroy =: 3 : 0 NB. release any resources acquired codestroy NB. release the instance, when it finishes execution ) char_ins=: 3 : 0 next_state =. (<current_state,alphabet i. y){trie smoutput next_state if. INTEGERZERO = next_state do. NB. adding in new transition trie =: trie,trierow max_rows=: max_rows+1 trie=: max_rows (<current_state,alphabet i. y)}trie current_state =: max_rows else. NB. transition for the letter already exists NB. just transition to the next state current_state =: next_state end. ) strval_ins =: 4 : 0 NB. encode a string into the trie and place the value y NB. in its end recognition state _1 char_ins\x trie =: y (<current_state,0)}trie current_state =: 1 ) NB. inc_lookup a verb to use with \ to ratchet letters of a string NB. through the trie NB. usage to check a string: '_1 inc_lookup\<your word>' NB. after running current state will be in a recognition state or NB. state 0 indicating word not in trie inc_lookup =: 3 : 0 current_state =: (<current_state,alphabet i. y){trie ) NB. routine that returns the next state based on character presented NB. this is set up to be used by fold: NB. 1 ] F:. flookup 'bachelor' NB. the big difference in this usage is the fold is terminated as soon NB. as a failed transition occurs. This should facilitate word-piece NB. tokenizing. flookup =: 4 : 0 NB. y is the current state NB. x is the character to transition on ns =. (<y,alphabet i. x){trie if. ns = 0 do. NB. 2 reasons we get here NB. the string does not exist in the trie NB. we may have recognized a portion of the string as NB. being in the trie NB. use Z: to interupt further folding 1 Z: 1 0 return. end. ns ) NB. some routines for testing purposes NB. a naive tokenization is to just to test words against a trie NB. for that we can just use the '\' operator as follows naive_tok =: 3 : 0 current_state =: 1 _1 inc_lookup\y (<current_state,0){trie ) NB. a naive tokenizer using fold NB. this hides the fact that fold will stop as soon as you hit NB. a transition to the 0 state. If you run the fold as: NB. 1 ] F:. flookup '<string>' NB. the output is the complete list of state transitions. NB. if the last state is 0 then unrecognized NB. otherwise use the last state to look up the integer token ftok =: 3 : 0 current_state =: 1 states =. current_state ] F:. flookup y (<(_1 {. states),0){trie )
![nil](/Users/tmcguire/j9.6-user/temp/graphview.png)
Handling Nunbers
Finding out how to handle numbers in text is a little difficult when searching since the point of tokenization is to translate into a number. A google search has a ton of links on the latter and very few on the former. The best review so far I have been able to find is by Andrew Carr on Twitter: https://x.com/andrew_n_carr/status/1714326003030638848 In the above link he reviews how numbers in LLM are being handled. I tend to think a [NUM] token is the best way of handling this for now.