NYCJUG/2014-05-13
string replace, file streaming, optimization software, working with large files, innate language instinct, effect of using foreign language on decision, coding in education
Summary of NYCJUG Meeting on 20140513
We looked at some variations of string replacement, from how we might develop a simple version in J versus the standard one supplied. From this, we introduced the problem of applying something like string replacement across a file too large to be read into memory in one piece, then generalized this to the problem of applying an arbitrary verb across a large file in pieces.
We also considered some aspects of human cognition as it relates to language acquisition and how language shapes cognition. Continuing in this vein, we talked about the usefulness of including coding as a standard part of an education curriculum.
Meeting Agenda for NYCJUG 20140513
1. Beginner's Regatta: developing a simple string replacement verb: see "String Replacement". 2. Show-and-tell: various ways to replace a string in a file: see "Replace a String in a File". Also, see "Streaming Through Large Files". 3. Advanced topics: see "Optimization Software". 4. Learning, teaching and promoting J, et al.: cognitive aspects of language: see "Born to chat: Humans may have innate language instinct" and "Using a foreign language changes moral decisions". Learning to code - see "Reading, Writing, Arithmetic, and Lately, Coding" - and why this is increasingly important - see excerpt from Forrester Computing white paper.
Beginner's Regatta: Simple String Replacement
We have a standard routine for replacing one string by another in a given text:
whereDefined 'stringreplace' C:\Program Files\J701\system\main\stdlib.ijs NB. YMMV
Here's an example of how it works:
('aa';'XXX') stringreplace 'abcaabbccaabbbccc' abcXXXbbccXXXbbbccc
Let's see how we might develop a simple version of this. First, we'll name the pieces using the example above.
'old new txt'=. 'aa';'XXX';'abcaabbccaabbbccc' whold=. old E. txt NB. Find the old string (' '-.~":whold),:txt NB. Show together 00010000010000000 abcaabbccaabbbccc ]ixold=. (I. whold)+/i.#old NB. Start/stop indexes of old 3 4 9 10
Now that we have the indexes of the old strings within the text, we can remove them:
]rmold=. (0) ixold}1$~#txt NB. 0s to remove old 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 rmold#txt abcbbccbbbccc
The tricky parts are making room to insert the new text and keeping track of where it should go in the reduced, intermediate text.
rmold#1|.-.rmold NB. Shift 1 to mark starting points of 0 0 1 0 0 0 1 0 0 0 0 0 0 NB. where to insert. (#new)*rmold#1|.-.rmold NB. Size of insertions 0 0 3 0 0 0 3 0 0 0 0 0 0 >:(#new)*rmold#1|.-.rmold 1 1 4 1 1 1 4 1 1 1 1 1 1
Here we've built an expansion vector with ones for the letters left untouched and larger numbers to allow room for both the original character at that point and the new string we'll be inserting.
]expand=: >:(#new)*rmold#1|.-.rmold NB. Expand for insertion 1 1 4 1 1 1 4 1 1 1 1 1 1 expand#rmold#txt abccccbbcccccbbbccc
Now we need to figure out where to place the new strings...
expand#rmold#1|.rmold 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1
This isn't quite right: we want the first zero to be a one to preserve the existing character before the insertion.
13 : 'y+.1,}:y' NB. generate tacit expression ] +. 1 , }: (] +. 1 , }:) expand#rmold#1|.rmold 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1
Create the Boolean with ones corresponding to the insertion points for the new strings.
]insnew=. -.(] +. 1 , }:) expand#rmold#1|.rmold NB. Insert new 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 ]insnew=. I. insnew NB. Use indexes 3 4 5 10 11 12
Ensure we replicate the new string enough times to correspond to how man times we're inserting a copy.
new$~#insnew XXXXXX
Put it all together to see that it works:
(new$~#insnew) insnew}expand#rmold#txt abcXXXbbccXXXbbbccc
Gather the relevant portions of code from above to define a simple string replacement verb.
simpleReplace=: 4 : 0 'old new'=. x [ txt=. y NB. Remove old text, insert new text. whold=. old E. txt NB. Find old strings. ixold=. (I. whold)+/i.#old NB. Indexes of old rmold=. (0) ixold}1$~#txt NB. Remove old expand=. >:(#new)*rmold#1|.-.rmold NB. Expand to make room for new insnew=. I.-.(]+.1,}:) expand#rmold#1|.rmold NB. Insert new (new$~#insnew) insnew}expand#rmold#txt )
Compare our new version to the existing one:
('aa';'XXX') stringreplace 'abcaabbccaabbbccc' abcXXXbbccXXXbbbccc ('aa';'XXX') simpleReplace 'abcaabbccaabbbccc' abcXXXbbccXXXbbbccc
See where it falls short:
('aa';'XXX') stringreplace 'aBaaaBBaaaa' NB. More complex because aBXXXaBBXXXXXX NB. self-overlap. ('aa';'XXX') simpleReplace 'aBaaaBBaaaa' aBXXXBBXXX
We see that our simple version gives a different answer in the case where the old string is found in self-overlapping instances. That is "aa" is found twice in the string "aaa": once starting at position zero and again at position one.
This may be sufficiently unusual that we are willing to live with the difference. However, there is also a bonus with our simple version: unlike the original, this also works with numeric vectors:
(1 1;9 9 9) simpleReplace 1 2 3 1 1 2 2 3 3 1 1 4 4 1 2 3 9 9 9 2 2 3 3 9 9 9 4 4
Show-and-tell
We looked at a wide array of suggestions for text replacement in various (non-J) languages in response to a StackOverflow question. The responses ranged from large bodies of code (JScript) to very succinct expressions (Perl and, surprisingly, the Windows Batch command language). These are of interest to compare with J in terms of readability.
Streaming Through Large Files
Of greater interest is a more general solution of developing a J adverb to apply something like a string replacement verb across a large file without reading the entire file into memory. The code explained here can be found in its entirety here.
An occasional disadvantage of an array-oriented language like J is that we like to work on an entire array at a time but sometimes the data is too large for this to work well. Often we have large amounts of data in a file, for instance:
fsize 'CIQ_G_History_New.txt' 185661834 10^.fsize 'CIQ_G_History_New.txt' 8.26872
Here’s one way we might start on this problem: by specifying a verb to work through the file until we’ve accumulated at least a full “line” of data, here defaulted to being delimited by LF (10{a.).
NB.* getFirstLine: get 1st line of tab-delimited file, along w/info NB. to apply this repeatedly to get subsequent lines. getFirstLine=: 3 : 0 (10{a.) getFirstLine y NB. Default to LF line-delimiter. : if. 0=L. y do. y=. 0;10000;y;'' end. 'st len flnm accum'=. 4{.y NB. Starting byte, length to read, file name, len=. len<.st-~fsize flnm NB. any previous accumulation. continue=. 1 NB. Flag indicates OK to continue (1) or no if. 0<len do. st=. st+len NB. header found (_1), or still accumulating (0). if. x e. accum=. accum,fread flnm;(st-len),len do. accum=. accum{.~>:accum i. x [ continue=. 0 else. 'continue st len flnm accum'=. x getFirstLine st;len;flnm;accum end. else. continue=. _1 end. NB. Ran out of file w/o finding x. continue;st;len;flnm;accum NB.EG hdr=. <;._1 TAB,(CR,LF) -.~ >_1{getFirstLine 0;10000;'bigFile.txt' NB. Assumes 1e4>#(1st line). ) 6!:2 'hdr=. >_1{getFirstLine ''CIQ_G_History_New.txt''' 0.423876 $hdr 285 $<;._1 TAB,hdr 24 5{.<;._1 TAB,hdr +----+-------+-------------+------+-------+ |Date|$ticker|$company_name|$cusip|DAT_ITM| +----+-------+-------------+------+-------+
The idea to expand this is that, contrary to the indication of the name, we could continue to read lines from the file by feeding the resulting pointer from one invocation to begin the next. However, we have already developed such a routine that we’ve covered in another meeting: an adverb called “doSomething”. The advantage of an adverb is that it leaves open a slot for a verb to do whatever arbitrary thing we want to with a file, within limits, as we shall soon see. This adverb is designed to work on text files that are conceptually large matrixes starting with a header line naming the columns.
This is implemented in the adverb by treating the first line of the file specially – essentially, we save this line to pass to subsequent invocations of the adverb after the first one. For this reason, as can been seen below, there are two major logic branches in the code for the adverb: the first branch segregates the first line into the variable “hdr” which is then returned as part of the output – to be passed on to subsequent invocations as the processing moves through the file – and is also offered as part of the argument to the verb supplied to “doSomething”.
Basic Adverb for Processing a Large File
doSomething=: 1 : 0 'curptr chsz max flnm leftover hdr'=. 6{.y if. curptr>:max do. ch=. curptr;chsz;max;flnm else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm chunk=. leftover,CR-.~>_1{ch NB. Work up to last complete line. 'chunk leftover'=. (>:chunk i: LF) split chunk NB. LF-delimited lines 'hdr body'=. (>:chunk i. LF) split chunk NB. Assume 1st line header. hdr=. }:hdr NB. Retain trailing partial line as "leftover". else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm 'body leftover'=. (>:chunk i: LF) split chunk end. u body;<hdr end. (4{.ch),leftover;<hdr NB.EG CTR [ ((10{a.)&(4 : 'CTR=: CTR + x +/ . = >0{y')) doSomething ^:_ ] 0x;1e6;(fsize 'bigFile.txt');'bigFile.txt' [ CTR=: 0 )
The example use shown in the “NB.EG” comment simply counts the number of LFs in a file, accumulating the result into the global variable “CTR”. Here’s an example of running this on a large file, along with a timing:
load '~Code/workOnLargeFile.ijs' 6!:2 '((10{a.)&(4 : ''CTR=: CTR + x +/ . = >0{y'')) doSomething ^:_ ] 0x;1e6;(fsize bigfl);bigfl [ CTR=: 0 [ bigfl=. ''CIQ_G_History_New.txt''' 4.38501 CTR 1371898 shell 'wc CIQ_G_History_New.txt' NB. Verify result using "wc" 1371899 32901273 185661834 CIQ_G_History_New.txt
The difference of one is due to the header being treated separately.
Issues With Using Adverb
Following is an example of using “doSomething” for a real problem that raises some issues for considering how this idea might be better-built. The problem to be solved is that a large data file contains some entries with duplicate keys which consist of the first two columns. To locate the duplicated records, we process the file to extract the contents of these first two columns.
To do this, we first define a verb to return a matrix given a character vector with LF-delimited lines and TAB-separated fields.
usuMatify=: (TAB,LF)&([: <;._1&> (0{[) ,&.> [: <;._2 ] (],[#~[~:[:{:])~ 1{ [)"0 1 usuMatify ('hi',TAB,'ho'),LF,'har',TAB,'har' +---+---+ |hi |ho | +---+---+ |har|har| +---+---+
The initially-evaluated verb phrase on the right - ( ],[#~[~:[:{:] ) – ensures that there is a trailing line-delimiter (internally generalized as " 1{[ " – the 1th character of the left argument) as this is expected by the line-partitioning phrase " <;._2 ". The tacit version of this right-most phrase was generated using J’s "explicit-to-tacit" verb like this: 13 : 'y,x#~x~:{:y' .
flnm=. 'CIQ_G_History_New.txt' accumKeys=: 3 : 'KEYS=: KEYS,2{."1 usuMatify y' 6!:2 '(accumKeys >0{y) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ KEYS=: 0 2$''''' 30.8413 $KEYS 1450459 2 3{.KEYS +--------+----+ |20140207|AIR | +--------+----+ |20140207|AFAP| +--------+----+ |20140207|AAL | +--------+----+ 6!:2 'KEYS=: ,&.>/"1 KEYS' NB. Combine keys into single item. 0.887262 $KEYS 1450459 $~.KEYS 1384524 (-/,])(#,#@:~.) KEYS NB. # duplicates, total # records, # unique recs 65935 1450459 1384524
When we want to remove the duplicates from the file, we find a difficulty with using our adverb which arises because of the external information we must use in the verb: the compression vector. We create this vector this way:
RMDUPS=: whUnq KEYS
Where
whUnq NB.* whUnq: 1 corresponds to unique item, 0 to duplicate one. 3 : '(/:/:y){1,2~:/\/:~y'
We implement the verb to remove duplicates using a number of global variables, not the most elegant way to do things.
rmDupes=: 3 : 0 ixs=. RECCTR+i.#mm=. usuMatify y (unmatify (ixs{RMDUPS)#mm) fappend NEWFL RECCTR=: RECCTR+#ixs )
We set up the necessary globals and run like this:
fsize flnm=. 'CIQ_VM.txt' 413733741 hdr=. >_1{getFirstLine flnm hdr fwrite NEWFL=: 'CIQ_VM_new.txt' NB. Initialize new file with header. 283 6!:2 '([: rmDupes [: > 0 { ]) doSomething ^:_ ] 0x;5e6;(fsize flnm);flnm [ RECCTR=: 0' 214.309
This last phrase shows that we can process the entire file in about 214 seconds.
Verify that the new file is smaller and does not have duplicate keys:
fsize NEWFL 394026934 6!:2 '(3 : ''KEYn=: KEYn,2{."1 usuMatify >0{y'') doSomething ^:_ ] 0x;5e6;(fsize NEWFL);NEWFL [ KEYn=: 0 2$''''' 74.5355 6!:2 'KEYn=: ,&.>/"1 KEYn' 0.904184 (-/,])(#,#@:~.) KEYn 0 1384592 1384592
One side-effect of processing these pieces of the file is left-over memory allocation:
7!:3'' 128 2769329 256 3200 512 39 1024 65 */"1]7!:3'' 354474240 819200 19456 66560 +/*/"1]7!:3'' 355379456
Advanced topics: Optimization Software
Scott Locklin posted this note:
from: Scott Locklin scott_locklin@yahoo.com to: programming@jsoftware.com date: Fri, Feb 22, 2013 at 12:50 AM subject: Re: [Jprogramming] deoptim and lbfgs minimization
For what it is worth, lbfgs works on linux. I stuck it on github, along with fortran source and Makefile in case it helps others. https://github.com/locklin/j-lbfgs
I also took the liberty of using the same code to do the box constrained version of the algorithm. The example is no good, and I think the loop needs to be fixed up, but it does call lbfgsb, and notice when you pass it infeasible initial conditions. Perhaps it is useful to someone: https://github.com/locklin/j-lbfgsb
I also got the flann library to mostly work for locality hashing, k-means trees and kd-trees for near neighbor searches. There's an issue with passing a raw pointer around (something I'd like to do for multiple online queries to a pre-sorted tree), so use it at your own risk. If anyone can spot what I'm doing wrong on the "buildtree/searchtree" examples, I'd appreciate it. It does calculate nearest neighbors, and very quickly. https://github.com/locklin/j-nearest-neighbor
I'll have a look at 64 bit lapack, as PCA/eigenvectors are important to me.
LBFGS
According to this,
The L-BFGS method solves the unconstrainted minimization problem,
minimize F(x), x = (x1, x2, ..., xN),
only if the objective function F(x) and its gradient G(x) are computable.
The well-known Newton's method requires computation of the inverse of the hessian matrix of the objective function. However, the computational cost for the inverse hessian matrix is expensive especially when the objective function takes a large number of variables. The L-BFGS method iteratively finds a minimizer by approximating the inverse hessian matrix by information from last m iterations. This innovation saves the memory storage and computational time drastically for large-scaled problems.
The other techniques Locklin mentions - k-means trees and kd-trees - are general methods for dealing with large amounts of multi-dimensional data. These methods allow us to specify clusters of points that are near each other to speed up locating nearest neighbors.
Learning and Teaching J
Language Studies
We looked at the results of some studies on language acquisition. One, which asserts that people instinctively organize language according to an internal semantic hierarchy, is seen by some as supporting the Chomskyan idea of an innate, Universal Grammar facility in human brains, though others point out that it could just as easily be an expression of general organizing principles of cognition. Another study measured blood flow in the brains of 24 newborn infants as they listened to recordings of nonsense syllables. This found that these infants appeared to distinguish between syllables that are more or less easily pronounceable in spite of presumably having had little exposure to speech. This was also seen as supporting the Chomskyan idea of innate language facilities, which would have implications for computer language design.
We also looked at an article, about another study, titled "Using a foreign language changes moral decisions". This article claims that a study by psychologists from the University of Chicago and Pompeu Fabra University in Barcelona "...that people using a foreign language take a relatively utilitarian approach to moral dilemmas, making decisions based on assessments of what's best for the common good. That pattern holds even when the utilitarian choice would produce an emotionally difficult outcome, such as sacrificing one life so others could live." This raises interesting ideas about the utility of reasoning with a (foreign) computational notation.
Materials
- File:ForresterITSpeedEnterprise.pdf
- File:String Replacement.pdf
- File:Replace a String in a File.pdf
- File:Streaming Through Large Files.pdf
- File:Optimization Software.pdf
- File:Born to chat - innate language instinct.pdf
- File:Using a foreign language changes moral decisions.pdf
- File:Coding being added to elementary-school curricula.pdf