User:Devon McCormick/Code/breakupBigFiles2CD.ijs
Jump to navigation
Jump to search
This code is explained in some detail here.
NB.* breakupBigFiles2CD.ijs: break up large files to fit onto CDs - 720e6 NB.bytes - or whatever size desired. NB. 20090826: use extended integers instead of floating point or integer pairs. coclass 'bbf2cd' load '~system/packages/files/bigfiles.ijs' NB. load '~Code/bigfiles.ijs' coinsert 'jbf' load 'task' NB. To run command in "assembleBrokenFiles". NB.* breakUpBigFile: break up file >0{y into number of pieces using NB.* assembleBrokenFiles: put pieces back together->big file. NB.* jfi: just files from dir listing NB.* sequenceGap: find missing integers in supposedly contiguous sequence. NB.* cvtIntPair: integer pair converter (both ways); fails above about 10^15. NB.* cvt1to2: convert pair of integers into single number. NB.* cvt2to1: convert single number into pair of integers. NB.* cvtLargeInt: convert (fp) num from 2^31 to <:2^32 to (signed) int. NB.* doSeveral: use "each" to break up several files from indir to outdir, NB.* buildBigNumdFl: build large file with (text representation of) number N NB. The following is still necessary because the underlying OS-specific NB. calls use integer pairs to overcome the size limitation of integers. NB. The overlying J code is not so constrained as extended integers can be NB. arbitrarily large but will fail for sizes >: 2^64 because of the NB. limitations of underlying system call. explainNecessaryLongIntegers=: 0 : 0 The "bigfiles" functions require integer arguments which exceed the scope of individual signed integers, so we must use pairs of signed integers to allow the APIs to recognize the numbers properly. In other words, we use a pair of signed, 32-bit, integers to represent a 64-bit integer. For example, the largest signed integer is normally 2^31 (2147483648). So, to represent the next largest integer, we use cvtIntPair 2147483649 0 _2147483647 The high-order bits are in the first integer of the resulting pair (0). The second integer is negative because the highest order bit is the sign bit in a signed integer. So, 2^32, which is 1 followed by 32 zeroes in binary, appears thusly: cvtIntPair 2^32 1 0 ) NB. This warning is obsolete following the change to extended integers: NB. **Danger: the following routine uses floating point numbers as large NB. integers, assuming we can convert while retaining precision. This NB. should be OK for sizes less than a few hundred billion, i.e. 10^15 or so NB. but I haven't really checked it thoroughly to determine the correct limits. NB.* breakUpBigFile: break up file >0{y into number of pieces using NB. dir/name >1{y reading in chunks of 0{x, filling up to 1{x byte NB. output files. Handles large files (>2^31 bytes). breakUpBigFile=: 3 : 0 1e7 720e6 breakUpBigFile y NB. Work w/1e7 bytes at-a-time, write 720e6 : 'flnm outflnm'=. y sufflen=. ->:outflnm i.&.|. '.' 'outsuff outpre'=. sufflen split outflnm totflsz=. bfsize flnm 'rdctr flctr'=. x: 0 0 NB. Need extended integers to read large file. flbase=. actsz=. 0 'maxchsz maxperfl'=. x: x if. -. nameExists 'BREAKATLINE' do. BREAKATLINE=. 0 end. NB. Max (0-origin) file counter length-># digits->leading zeros in output file NB. name number portion. maxctrlen=. >.10^.maxperfl%~totflsz while. rdctr<totflsz do. outflnm=. outpre,(maxctrlen lead0s flctr),outsuff '' fwrite outflnm NB. Initialize output file chsz=. maxchsz flmax=. maxperfl+flbase=. flbase+actsz flmax=. flmax<.totflsz actsz=. adj=. 0 while. rdctr<flmax+adj do. chsz=. chsz<.flmax-rdctr ch=. bixreadx flnm;rdctr,chsz adj=. 0 if. flmax<:rdctr+chsz do. if. BREAKATLINE do. if. LF~:{:ch do. ch=. ch}.~adj=. -<:(#ch)-ch i: LF end. end. end. assert. (chsz+adj)=ch bappend outflnm NB. Write as much as read? actsz=. actsz+chsz+adj rdctr=. x: rdctr+chsz+adj end. smoutput (': ',~":qts ''),'Wrote ',outflnm,'; length = ',":rdctr-flbase flctr=. >:flctr end. NB.EG breakUpBigFile 'F:\Video\RailwayChP2.avi';'C:\Video\RailwayChP2avi.dat' ) NB. 13!:3 'breakUpBigFiles : 14 20 25 28' NB.* assembleBrokenFiles: put pieces back together->big file. assembleBrokenFiles=: 3 : 0 'flnm partfls'=. y sufflen=. (#-~'.'i:~])partfls 'outsuff outpre'=. sufflen split partfls NB. The "broken" files to be assembled should have names like, e.g. NB. 'Outfl0.dat';'Outfl1.dat';'Outfl2.dat'..., so most of the following work NB. is to extract and validate the numbers at the end of "Outfl" before the NB. suffix ".dat": they should all be valid numbers and in sequence. NB. We do a lot work validating these names because they should have been NB. generated according to the above function "breakUpBigFiles". We expect NB. the inputs to this function to conform as it is the inverse of that one. outpath=. outpre{.~>:outpre i: PATHSEP_j_ flnmprelen=. #>{:<;._1 PATHSEP_j_,outpre ofls=. {."1 jfi dir outpre,'*',outsuff flnums=. (flnmprelen}.sufflen}.])&.>ofls if. 0 e. whvn=. isValNum&>flnums do. smoutput 'Excluding file',('s'#~1~:0 +/ . =whvn),' with invalid ' smoutput 'numeric portion: ','.',~punclist ofls#~-.whvn 'ofls flnums'=. (<whvn)#&.>ofls;<flnums end. NB. It might be an error to proceed with an incomplete sequence, but we will. misssq=. sequenceGap flnums=. /:~".&>flnums if. 0~:#misssq do. smoutput 'Proceeding with missing sequence number',('s'#~1~:#misssq),':' smoutput ' ','.',~punclist ":&.>misssq end. osuff=. ~.sufflen{.partfls cmd=. }:;((<outpre),&.>":&.>flnums),&.><osuff,'+' smoutput 'Running command: ','...',~cmd=. 'copy /b ',cmd,' ',flnm shell cmd NB. This could take a while depending on file size. NB. copy /b BigFlPart0.dat+BigFlPart1.dat+BigFlPart2.dat+BigFlPart3.dat BFl.txt ) NB. 13!:3 'assembleBrokenFiles 16 19 24 32' NB. Useful breakpoints NB. These two utility fns, included for completeness, are from other libraries. jfi=: 3 : '(-.''d''e.&>4{"1 y)#y' NB.* jfi: just files from dir listing nameExists=: 0:"_ <: [: 4!:0 <^:(L. = 0:) NB.* nameExists: 1 if name exists NB.* sequenceGap: find missing integers in supposedly contiguous sequence. sequenceGap=: 3 : 0 fullseq=. (<./y)+i.>:(>./-<./)y /:~fullseq-.y NB.EG sequenceGap 13 14 18-.~10+i.10 ) NB.* cvtIntPair: integer pair converter (both ways); fails above <:2x^64. cip=: cvt1to2 :.cvt2to1 cvtIntPair=: 3 : 0 if. 2={:$y do. cvt2to1 y else. cvt1to2 y end. ) NB.* cvt1to2: convert pair of integers into single number. cvt1to2=: 3 : '<.(_4294967296x*qq>2147483648x)+qq=. 4294967296x 4294967296x#:y' NB.* cvt2to1: convert single number into pair of integers. cvt2to1=: 3 : '4294967296x#.|:4294967296x&||:y' NB. These next 2 are based on Dave Mitchell's "bigfiles" code: for exegesis. 3 : 0 '' if. -.nameExists 'K31' do. K31=: 2^31 end. ) NB.* cvtLargeInt: convert (fp) num from 2^31 to <:2^32 to (signed) int. cvtLargeInt=: 3 : 0 if. y>:K31 do. K31-~K31#:y else. y end. NB.EG cvtLargeInt"0 ] 2147483647 2147483648 2147483649 4294967295 4294967296 NB. 2147483647 _2147483648 _2147483647 _1 _2147483648 NB. Note failure for final argument above (=2^32). ) NB.* doSeveral: use "each" to break up several files from indir to outdir, NB. auto-generate name prefixes and suffixes. doSeveral=: 4 : 0 'indir outdir'=. endSlash&.>x flnm=. y infl=. indir,flnm outfl=. outdir,(flnm-.'.'),'.dat' NB. Break into 240e6 pieces because this is about 1/3 of a CD and using NB. pieces smaller than 1/2 of CD gives more flexibility. 1e7 240e6 breakUpBigFile infl;outfl NB.EG (<'F:\bigfiles\';'C:'\brokenFiles\') doSeveral&.>'file1.zun';'file2.foo' ) NB.* endSlash: ensure path has ending slash. endSlash=: 13 : 'y,PATHSEP_j_#~PATHSEP_j_~:{:y' NB.* buildBigNumdFl: build large file with (text representation of) number N NB. beginning at byte N; can be run to append to existing file. NB. This large file handy for testing large-file fns as file locations are NB. designated by the file contents, i.e. NB. frdix 'C:\Temp\BigFile.txt';1000 20 NB. Read 20 bytes starting at 1000 NB. 1000 1005 1010 1015 NB. shows us that we properly started reading at file location 1000. NB. Or, for an even larger file, we must use "bixreadx": NB. bixreadx 'C:\Temp\BigFile.txt';2147483649x 32 NB. 2147483649 2147483660 2147483671 NB. bfsize 'C:\Temp\BigFile.txt' NB. 2303603063 NB.* buildBigNumdFl: build large file with (text representation of) number N: faster. buildBigNumdFl=: 3 : 0 'nn bigfl'=. y NB. Append nn numbers if. -.fexist bigfl do. nn=. <:nn NB. Initialize if no file. '0 ' fwrite bigfl end. NB. Start counting at zero. while. 0<nn do. ctr=. bfsize bigfl NB. Reduce number of file writes len=. 2+<.10^.ctr NB. Length of nums now n2app=. 1e5<.nn<.>.len%~ctr-~10^<:len (' ',~":ctr+len*i.n2app) bappend bigfl nn=. nn-n2app end. bfsize bigfl NB.EG buildBigNumdFl 2e8;'C:\Temp\BigFile.txt' ) NB.-- Older version: this is off by one on the initial run because of the NB. uncounted initialization. explanationOfPeculiarDualNumArg=. 0 : 0 One peculiarity of the arguments to the function ''buildBigNumdFlSlower'' is the use of two numbers, which are essentially multiplied together, to specify how many integers to write out. The numbers are, respectively, the number of times through the outer and inner loops. The inner loop builds the string in memory; the string is only written to file at the end of the inner loop. The intention here is to allow the user to try different combinations to balance file writing time versus memory allocation time to achieve the best throughput for a given machine. This could undoubtedly be made more efficient but I haven't bothered since it's such a limited-use function. ) buildBigNumdFlSlower=: 3 : 0 'nouter ninner bigfl'=. y NB. Append nouter*ninner numbers if. -.fexist bigfl do. NB. Need to initialize file? '0 ' fwrite bigfl end. NB. Start counting at zero. while. _1<nouter=. <:nouter do. NB. Build string in inner loop to fsz=. bfsize bigfl NB. reduce number of file writes ctr=. 0 [ str=. '' NB. (for better efficiency). while. ninner>:ctr=. >:ctr do. str=. str,' ',~0j0": fsz+#str end. str bappend bigfl end. 0j0":bfsize bigfl NB.EG buildBigNumdFl 10000;1000;'C:\Temp\BigFile.txt' ) bfsize=: 3 : 0 if. t y do. try. fh=: CreateFileR (y,{.a.);GENERIC_READ;0;NULLPTR;OPEN_EXISTING;0;0 catch. (13!:11 '');(13!:12 '') return. end. if. fh=_1 do. cderx'' return. end. F=. 1 else. fh=. y F=. 0 end. b=. ,2 ts=. GetFileSizeR fh;b if. F do. CloseHandleR fh end. NB. K32#.|:K32&|b,ts cvt2to1 b,ts ) cvt1to2_jbf_=: ([: ([: <. (_4294967296x * 2147483648x < ]) + ]) 4294967296 4294967296x #: ]) NB.* bixreadx: read (big) files using extended integers for indexing. NB. bixreadx fname;startx[,len] bixreadx_jbf_=: 13 : 'bixread (0{y),<x:^:_1 (cvt1to2 {.>1{y),}.>1{y' NB.* bixwritex: read (big) files using extended integers for indexing. NB. data bixwritex fname;startx[,len] bixwritex_jbf_=: 13 : 'x bixwrite (0{y),<x:^:_1 (cvt1to2 {.>1{y),}.>1{y'