User:Devon McCormick/Code/WorkOnLargeFiles
A slightly earlier version of the code below is explained in some detail here. The major update since this earlier explanation was written is the addition of the "passedOn" parameter in the adverb "doSomething" which applies a verb across a large file in pieces. The "doSomething" adverb assumes the file is structured with a first row being column headers and subsequent rows being LF-delimited. The first row is made available to processing of pieces of the file following the initial one. So, if we need to look up a column by name, we can use the header row as a reference for the column location.
The "passedOn" parameter allows the verb called by "doSomething" to pass information to subsequent invocations of the verb. This might include things like file statistics or a row count.
An example of using this code to apply verbs across a large, tab-delimited file is embodied in this code.
NB.* workOnLargeFile.ijs: apply arbitrary verb across large file in blocks. NB.* doSomething: do something to a large file in sequential blocks, by lines. NB. Args: pointer to current location in file, size of chunk to read each time, NB. size of file, name of file, [piece of a chunk left over from previous NB. call, file header (first line), result of previous call to be passed on NB. to next one. doSomething=: 1 : 0 'curptr chsz max flnm leftover hdr passedOn'=. 7{.y if. curptr>:max do. ch=. curptr;chsz;max;flnm else. if. 0=curptr do. ch=. readChunk curptr;chsz;max;flnm chunk=. leftover,CR-.~>_1{ch NB. Work up to last complete line. 'chunk leftover'=. (>:chunk i: LF) split chunk NB. LF-delimited lines 'hdr body'=. (>:chunk i. LF) split chunk NB. Assume 1st line is header. hdr=. }:hdr NB. Retain trailing partial line as "leftover". else. chunk=. leftover,CR-.~>_1{ch=. readChunk curptr;chsz;max;flnm 'body leftover'=. (>:chunk i: LF) split chunk end. passedOn=. u body;hdr;<passedOn NB. Allow u's work to be passed on to next invocation end. (4{.ch),leftover;hdr;<passedOn NB.EG ((10{a.)&(4 : '(>_1{y) + x +/ . = >0{y')) doSomething ^:_ ] 0x;1e6;(fsize 'bigFile.txt');'bigFile.txt';'';'';0 NB. Count LFs in file. ) NB.* getFirstLine: get 1st line of tab-delimited file, along w/info NB. to apply this repeatedly to get subsequent lines. getFirstLine=: 3 : 0 (10{a.) getFirstLine y NB. Default to LF line-delimiter. : if. 0=L. y do. y=. 0;10000;y;'' end. 'st len flnm accum'=. 4{.y NB. Starting byte, length to read, file name, len=. len<.st-~fsize flnm NB. any previous accumulation. continue=. 1 NB. Flag indicates OK to continue (1) or no if. 0<len do. st=. st+len NB. header found (_1), or still accumulating (0). if. x e. accum=. accum,fread flnm;(st-len),len do. accum=. accum{.~>:accum i. x [ continue=. 0 else. 'continue st len flnm accum'=. x getFirstLine st;len;flnm;accum end. else. continue=. _1 end. NB. Ran out of file w/o finding x. continue;st;len;flnm;accum NB.EG hdr=. <;._1 TAB,(CR,LF) -.~ >_1{getFirstLine 0;10000;'bigFile.txt' NB. Assumes 1e4>#(1st line). ) readChunk=: 3 : 0 'curptr chsz max flnm'=. 4{.y if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2 else. chunk=. '' end. (curptr+chsz2);chsz2;max;flnm;chunk NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize 'bigFile.txt');'bigFile.txt' ) readChunk_egUse_=: 0 : 0 ch0=. readChunk 0;1e6;(fsize 'bigFile.txt');'bigFile.txt' chunk=. CR-.~>_1{ch0 'chunk leftover'=. (>:chunk i: LF) split chunk 'hdr body'=. split <;._1&> TAB,&.><;._2 chunk body=. body#~-.a: e.~ body{"1~hdr i. <'PRCCD - Price - Close - Daily' unqids=. ~.ids=. ;&.><"1 body{"1~ hdr i. '$gvkey';'$iid' dts=. MDY2ymdNum&>0{"1 body (unqids textLine ids (<./,>./) /. dts) fappend 'IDsDateRanges.txt' ) accumDts2File=: 4 : 0 'body hdr'=. y hdr=. <;._1 TAB,hdr 'lkupPxs lkupID'=. hdr i. 2{.x [ outflnm=. >_1{x body=. <;._1&> TAB,&.><;._2 body body=. body#~-.a: e.~ body{"1~lkupPxs unqids=. ~.ids=. body{"1~lkupID dts=. MDY2ymdNum&>0{"1 body (unqids textLine ids (<./,>./) /. dts) fappend outflnm NB.EG ('PRCCD - Price - Close - Daily - USD';'$issue_id';'IDsDateRanges.txt') accumDts2File body;<hdr ) NB.* MDY2ymdNum: 'mm/dd/yyyy' -> yyyymmdd MDY2ymdNum=: [: ". [: ; _1 |. [: <;._1 ] ,~ [: {. [: ~. '0123456789' -.~ ]
-- Devon McCormick <<DateTime(2015-01-14T17:07:38-0200)>>