Scripts/Working with Big Files
This is Mostly Obsolete
With modern 64-bit J, the regular file verbs covered in "stdlib" - which are loaded by default - work fine on files larger than 2 GB. However, the following is still applicable for 32-bit systems. Also these definitions may still be useful as examples of using external DLLs from within J; some of the verbs may still be useful on their own as well. For instance, "bigNumdFl" is still handy for quickly building a usefully-defined file. To use this on a 64-bit system, I make the following global definitions after loading it since I can use the regular versions of file verbs:
bfsize=: fsize bappend=: fappend
Working with Big Files: Using Numbers Bigger than an Integer
Sometimes we have to deal with large files -- those larger than 2 gigabytes.
The immediate use of the functions in breakupBigFiles2CD.ijs is to allow us to break apart a file too large to fit onto a writable CD so that we can write the file onto CDs in pieces in order to move it or archive it. There is also a function to allow us to re-assemble the pieces into the original, large file.
Download the code File:BreakupBigFiles2CD.ijs (or you can take a look at it here). Be sure to load it
load '~Code/breakupBigFiles2CD.ijs'
(where "Code" was set in Edit/Configure/Folders to point to my code directory) and make the namespace transparently available:
coinsert 'bbf2cd'
before trying the following examples.
The pedagogical purpose of these functions is to demonstrate how to use the functions in bigfiles.ijs for dealing with files larger than 2 GB.
The main problem with using these functions is the restriction on the size of an integer: how do we address parts of a file with locations not representable by a standard 32-bit (or signed, 31-bit) integer?
In the breakupBigFiles2CD.ijs code, we use a function cvtIntPair to convert between a (floating point) number of sufficient magnitude and a pair of signed integers suitable for passing to the Windows APIs. However, since J does have extended integers, we can use these externally as long as we're careful to coerce them to int before passing them to the API.
That is, since the underlying bigfiles functions require integer arguments which exceed the scope of individual signed integers, we use pairs of signed integers, a datatype native to J, to allow the APIs to recognize the numbers properly.
For example, the largest signed integer is normally 2^31 (2147483648).
So, to represent the next largest integer, we use
cvtIntPair 2147483649 0 _2147483647
The high-order bits are in the first integer of the resulting pair (0). The second integer is negative because the highest order bit is the sign bit in a signed integer. So, 2^32, which is 1 followed by 32 zeroes in binary, appears thusly:
cvtIntPair 2^32 1 0
Note that cvtIntPair is its own inverse. If the argument is a pair of numbers, it converts them to a single number; a single number gets converted to a suitable pair of numbers.
NB.* cvtIntPair: integer pair converter (both ways); fails above <:2x^64. cip=: cvt1to2 :.cvt2to1 cvtIntPair=: 3 : 0 if. 2={:$y do. cvt2to1 y else. cvt1to2 y end. ) NB.* cvt1to2: convert pair of integers into single number. cvt1to2=: 3 : '<.(_4294967296x*qq>2147483648x)+qq=. 4294967296x 4294967296x#:y' NB.* cvt2to1: convert single number into pair of integers. cvt2to1=: 3 : '4294967296x#.|:4294967296x&||:y'
The self-inverse property of cvtIntPair also allows us to verify its correctness. The following examples show how the function applied to its output (should) give us the original input. However, for sufficiently large numbers, this scheme [used to] break down because of the limited precision of floating point numbers.
cvtIntPair 4e9 0 _294967296 cvtIntPair^:2 ] x: 4e9 4000000000
Since we use extended integers, it's not necessary to use formatting (16j0":) to see all the digits to verify that output looks the same as input.
Now, look at successively larger numbers to see where cvtIntPair breaks:
cvtIntPair^:2 ] 1234567890 1234567890 cvtIntPair^:2 ] 12345678901234 12345678901234 cvtIntPair^:2 ] 123456789012345 123456789012345 cvtIntPair^:2 ] 9234567890123456 9234567890123456
This last one still looks OK, but we've already passed the point of losing precision - the example above just happened to work out, as the following examples, using extended integers, or not, should make clear:
cvtIntPair^:2 ] 9123456789012345 NB. Last digit changes 9123456789012344 cvtIntPair^:2 ] 9123456789012345x NB. Or doesn't if done properly w/extended integer 9123456789012345
The good news is that we should be fine up to one less than 2^64:
2x^64 18446744073709551616 cvtIntPair^:2 ] 18446744073709551616x 0 cvtIntPair^:2 ] 18446744073709551615x 18446744073709551615 10^.2x^64 19.26592
Explanation of Selected Functions
We'll look at the following major three functions:
NB.* buildBigNumdFl: build large file with (text representation of) number N NB.* breakUpBigFile: break up file >0{y. into number of pieces using NB.* assembleBrokenFiles: put pieces back together->big file.
buildBigNumdFl
The first of these, buildBigNumdFl, builds a large file we can use to test the other two major functions of this suite. This file has a special format to make it handy for testing large-file functions: it's a text file with the file locations designated by the file contents. That is, starting at the 1000th byte is the text '1000'.
Obviously, this cannot work for every location since each number takes more than one byte but we just fill in each number as available. So, since the string '1000 ' (with a space separator at the end) takes 5 bytes, the next number has to be '1005' since it's starting in the 1005th byte. The initial byte is location zero so the first string in the file is '0 '.
Interestingly, at least the first several integer powers of 10 seem to be present: does this hold in general? For a look at some examples, see the end of this essay: Powers of Ten in a Self-Numbered File.
The format of the file makes it readily apparent, for instance, that our file-reading functions are working on the expected locations. Some examples follow.
(1!:11) 'C:\Temp\BigFile.txt';1000 20 |index error | (1!:11)'C:\Temp\BigFile.txt';1000 20
The standard indexed file read foreign conjunction does not work for this large file (except, as noted above, if you're running 64-bit version of J). For a file larger than 2 GB, we must use (our adaptation of) bigfiles function "bixreadx":
bixreadx 'C:\Temp\BigFile.txt';1000 20 1000 1005 1010 1015
This shows us that we properly started reading at file location 1000.
Now look at other locations:
bixreadx 'C:\Temp\BigFile.txt';2147483656x 32 2147483656 2147483667 2147483678 bfsize 'C:\Temp\BigFile.txt' 2196023059 NB. Read last few bytes in file: bixreadx 'C:\Temp\BigFile.txt';(2196023059x-33),33 2196023026 2196023037 2196023048
Here's an example of starting construction of a large file by writing 210 million integers which results in a file slightly larger than 2^31:
buildBigNumdFl 2.1e8;'C:\Temp\BigFile.txt'
Since it may take some time (20 minutes or more, depending on the speed of your machine) to build a file larger than 2 GB, the function is designed to pick up where it left off so you can build the file in pieces as you have the time to spare.
NB.* buildBigNumdFl: build large file with (text representation of) number N: faster. buildBigNumdFl=: 3 : 0 'nn bigfl'=. y NB. Append nn numbers if. -.fexist bigfl do. nn=. <:nn NB. Initialize if no file. '0 ' fwrite bigfl end. NB. Start counting at zero. while. 0<nn do. ctr=. bfsize bigfl NB. Reduce number of file writes len=. 2+<.10^.ctr NB. Length of nums now n2app=. 1e4<.nn<.>.len%~ctr-~10^<:len (' ',~":ctr+len*i.n2app) bappend bigfl nn=. nn-n2app end. bfsize bigfl NB.EG buildBigNumdFl2 2e8;'C:\Temp\BigFile.txt' )
breakUpBigFile
This is ostensibly the main function. The default arguments are chosen under the assumption that we want to put the resulting pieces of the large file onto writable CDs with a capacity of 720 million bytes. We resist the all-too-common convention of calling this 700 Megabytes because we would rather not obfuscate the meaning of well-established prefixes, which are based on decimal, rather than muddying their meaning with base-2 approximations.
NB.* breakUpBigFile: break up file >0{y into number of pieces using NB. dir/name >1{y reading in chunks of 0{x, filling up to 1{x byte NB. output files. Handles large files (>2^31 bytes). breakUpBigFile=: 3 : 0 1e7 720e6 breakUpBigFile y NB. Work w/1e7 bytes at-a-time, write 720e6 : 'flnm outflnm'=. y sufflen=. ->:outflnm i.&.|. '.' 'outsuff outpre'=. sufflen split outflnm totflsz=. bfsize flnm 'rdctr flctr'=. x: 0 0 NB. Need extended integers to read large file. flbase=. actsz=. 0 'maxchsz maxperfl'=. x: x if. -. nameExists 'BREAKATLINE' do. BREAKATLINE=. 0 end. NB. Max (0-origin) file counter length-># digits->leading zeros in output file NB. name number portion. maxctrlen=. >.10^.maxperfl%~totflsz while. rdctr<totflsz do. outflnm=. outpre,(maxctrlen lead0s flctr),outsuff '' fwrite outflnm NB. Initialize output file chsz=. maxchsz flmax=. maxperfl+flbase=. flbase+actsz flmax=. flmax<.totflsz actsz=. adj=. 0 while. rdctr<flmax+adj do. chsz=. chsz<.flmax-rdctr ch=. bixreadx flnm;rdctr,chsz adj=. 0 if. flmax<:rdctr+chsz do. if. BREAKATLINE do. if. LF~:{:ch do. ch=. ch}.~adj=. -<:(#ch)-ch i: LF end. end. end. assert. (chsz+adj)=ch bappend outflnm NB. Write as much as read? actsz=. actsz+chsz+adj rdctr=. x: rdctr+chsz+adj end. smoutput 'Wrote ',outflnm,'; length = ',":rdctr-flbase flctr=. >:flctr end. NB.EG breakUpBigFile 'F:\Video\RailwayChP2.avi';'C:\Video\RailwayChP2avi.dat' )
assembleBrokenFiles
This is not a very well-designed function but is included here for the sake of completeness. It works by creating a DOS binary copy command (using copy option /b) to re-assemble the smaller files into the larger one.
The function expends much effort validating the input file names. This is because we expect them to have been generated by the above breakUpBigFile function, hence we know what they should look like, and a missing or mis-placed piece would give us an incorrect assembly.
The known problems with the design of this function are that it relies on an OS-specific set of commands and is vulnerable to breaking if, for instance, a large file has been broken into so many pieces that the resulting command line is too large, requires too much interim memory, or is too slow to run.
NB.* assembleBrokenFiles: put pieces back together->big file. assembleBrokenFiles=: 3 : 0 'flnm partfls'=. y. sufflen=. (#-~'.'i:~])partfls 'outsuff outpre'=. sufflen split partfls NB. The "broken" files to be assembled should have names like, e.g. NB. 'Outfl0.dat';'Outfl1.dat';'Outfl2.dat'..., so most of the following work NB. is to extract and validate the numbers at the end of "Outfl" before the NB. suffix ".dat": they should all be valid numbers and in sequence. NB. We do a lot work validating these names because they should have been NB. generated according to the above function "breakUpBigFiles". We expect NB. the inputs to this function to conform as it is the inverse of that one. outpath=. outpre{.~>:outpre i: PATHSEP_j_ flnmprelen=. #>{:<;._1 PATHSEP_j_,outpre ofls=. {."1 jfi dir outpre,'*',outsuff flnums=. (flnmprelen}.sufflen}.])&.>ofls if. 0 e. whvn=. isValNum&>flnums do. smoutput 'Excluding file',('s'#~1~:0 +/ . =whvn),' with invalid ' smoutput 'numeric portion: ','.',~punclist ofls#~-.whvn 'ofls flnums'=. (<whvn)#&.>ofls;<flnums end. NB. It might be an error to proceed with an incomplete sequence, but we will. misssq=. sequenceGap flnums=. /:~".&>flnums if. 0~:#misssq do. smoutput 'Proceeding with missing sequence number',('s'#~1~:#misssq),':' smoutput ' ','.',~punclist ":&.>misssq end. osuff=. ~.sufflen{.partfls cmd=. }:;((<outpre),&.>":&.>flnums),&.><osuff,'+' smoutput 'Running command: ','...',~cmd=. 'copy /b ',cmd,' ',flnm shell cmd NB. This could take a while depending on file size. NB. copy /b BigFlPart0.dat+BigFlPart1.dat+BigFlPart2.dat+BigFlPart3.dat BFl.txt )
Powers of Ten in a Self-Numbered File
Does the self-numbered file produced by buildBigNumdFl always happen to line up to give us integral powers of ten? It seems like it for this first five in this sample:
bixreadx&>(<flnm);&.>(10-~10^>:i.9),&.>18+6,~2#+:i.4 0 2 4 6 8 10 13 16 91 94 97 100 104 8 992 996 1000 1005 9990 9995 10000 1000 988 99994 100000 10000 999990 999997 1000004 99988 9999996 10000004 1 9986 99999995 100000004 984 999999994 1000000004
But we see from looking at a window around the integral powers of ten that it fails when we get to one million. This sequence can be found in the On-line Encyclopedia of Integer Sequences.