NYCJUG/2019-01-15
Beginner's Regatta
Finding Hex Words
There's a sequence of hexadecimal digits known to assembly language programmers: DEAD BEEF. It's a silly use of hexadecimal digits to spell recognizable words. It's of little use other than perhaps to make it easy to spot memory values initialized to this but not subsequently over-written.
In the spirit of silliness, we set out to find other words that can be spelled using only the letters A through F, though the following works only in lower-case.
First, we read in a large file of English words and break it into words.
flnm=. 'e:\amisc\txt\Words\githubDwyl_english-words.txt' $wl=. <;._2 ] LF (] , [ #~ [ ~: [: {: ]) CR-.~fread flnm 466544
Next we select only those words represented in lower-case and having four or more letters.
Alpha_j_ NB. Standard variable in "j" namespace ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz wl=. wl#~*./&>wl e. &.><26}.Alpha_j_ NB. Only lower-case alphabetics $wl=. wl#~4<:&>#&>wl NB. Only 4 or more letters 340017 5{.wl NB. First 5 words +-----+------+----+-----+------+ |aahed|aahing|aahs|aalii|aaliis| +-----+------+----+-----+------+ _5{.wl NB. Last 5 words +---------+----------+-------+----------+------------+ |zwiebacks|zwieselite|zwitter|zwitterion|zwitterionic| +---------+----------+-------+----------+------------+
Finally, we further filter down to only words consisting completely of the letters a-f, write these to file, and display them.
hex=. 'abcdef' $hw=. wl#~ *./&>wl e.&.><hex 90 hw +----+-----+-----+-----+----+-----+----+----+----+------+-------+---... |abac|abaca|abada|abaff|abed|abede|acad|acca|acce|accede|acceded|ace... +----+-----+-----+-----+----+-----+----+----+----+------+-------+---... (;LF,~&.>hw) fwrite 'hexWords.txt' 539 q:90 2 3 3 5 8 12$96{.hw +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |abac |abaca |abada |abaff |abed |abede |acad |acca |acce |accede |acceded|acea | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |aceae |aced |addda |added |adead |aface |afaced|affa |baaed |bacaba |bacca |baccae | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |bade |baff |baffed|bead |beaded|bebed |bedaff|bedded|bedead |bedeaf |beebee |beef | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |beefed|caba |cabaa |cabbed|cabda |cace |cadded|cade |cadee |caeca |caff |caffa | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |ceca |cecca |cede |ceded |dabb |dabba |dabbed|daff |daffed |dead |deaf |debe | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |decad |decade|decaf |decd |decede|deda |dedd |deed |deeded |deedeed|deface |defaced| +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |defade|ebbed |ebcd |ecce |efface|effaced|faade |facade|facaded|face |faced |fade | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+ |faded |faff |feeb |feed |feeded|feff | | | | | | | +------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
Show-and-tell
Setting a Random Seed
Sometimes when we are using random numbers, we don’t want to rely on the default setting for the random seed. This may be for a number of reasons such as spinning off a number of simultaneous processes that require different starting points.
Here is what J offers as random number generators (from https://code.jsoftware.com/wiki/Vocabulary/query):
1. J offers a choice of random-number generator (RNG):
Code | RNG Name | Time |
---|---|---|
1 | GB_Flip | 1 |
2 | Mersenne Twister (default) | 1 |
3 | DX-1597-4d | 3 |
4 | MRG32k3a | 8 |
0 | Combination of all the above RNGs | 12 |
The Time column gives an estimate of relatively how much time each one uses, where higher is slower.
Select the RNG with 9!:43 n where n comes from the table above, or use
9!:42 ''
to see which RNG is selected.
2. The sequence of random numbers depends on the state of the RNGs. This state, which includes the choice of RNG to use, can be read by
RNGstate =: 9!:44 ''
and restored by 9!:45 RNGstate. The state is a list of boxes whose format should not be disturbed.
3. The state of the RNG can be reset by
9!:1 initvalue
where initvalue is an integer atom (or, for the Mersenne Twister generator only, an integer list) that selects a starting state. Resetting to a given initvalue always returns the RNG to the same initial state. Phrase 9!:0 will return the most recently used initvalue.
For RNGs 3 and 4, initvalue must not be 0.
Basic Idea for Seeding
We take something like a timestamp
6!:0 ''
because it gives us a different value nearly every time we call it, and flip it so the least significant portion is first, then reduce it to a single number with some base encoding like this we can use to set the seed:
13#.|.6!:0'' 2.19267e7 9!:1 ] <.13#.|.6!:0''
We round down with “floor” because the argument to “9!:1” needs to be a whole number.
Comparing Different Methods
How well would this work if we were starting a number of processes nearly simultaneously? To figure this out, let’s look at the distribution of “<.13#.|.6!:0”. We will generate a number of random seeds with varying amounts of wait time between each to see how widely-distributed these seeds might be.
testRandSeedSet13=: 3 : 0 outflnm=. 'RndTest13.txt' while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":13#.|.qts'') fappend outflnm end. )
We will compare this with two variants:
testRandSeedSet0=: 3 : 0 outflnm=. 'RndTest-7_11_13.txt' while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":-/7 11 13#."(0 1)|.qts'') fappend outflnm end. ) testRandSeedSet3=: 4 : 0 outflnm=. x while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":+/7 11 13 17#."(0 1)|.qts'') fappend outflnm end. )
The wider the distribution, the better, so testRandSeedSet13 looks the best here.
A few other attempts give us better results:
testRandSeedSet1=: 3 : 0 outflnm=. 'RndTest+7_11_13.txt' while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":+/7 11 13#."(0 1)|.qts'') fappend outflnm end. ) testRandSeedSet3=: 4 : 0 outflnm=. x while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":+/5 7 11#."(0 1)|.qts'') fappend outflnm end. ) testRandSeedSet6=: 4 : 0 outflnm=. x while. _1<y=. <:y do. 6!:3]0.1*>:?20 [ (LF,~12j0":setSeed'') fappend outflnm end. ) setSeed=: 3 : '<.(2^31)|13#.|.(6!:9]y),(6!:1]y),(6!:4]y),6!:0]y'
Plotting File Sizes
Here we use a histogram to get an idea of the distribution of the sizes of different kinds of files. We assume we have the file names in a vector of character strings FLNMS and the corresponding sizes in a numeric vector FLSZS. We create these by running code from here - parseDir.ijs - usually with buildTimedDiskInfo, used like this, for example, to gather information on the C: drive:
". buildTimedDiskInfo 'C' NB. Build vecs "FLNM_C_" etc.
Look at the file size vector:
+/FLSZS 307166294265 10^.+/FLSZS 11.4874
Load my statistics functions in order to run usus - the usual statistics: minimum, maximum, mean, and standard deviation.
load 'mystats' usus FLSZS 0 2.67859e10 267342 2.95951e7
Load the plot routines which are used by the histogram plotter plotHistoMulti - which may be found here:
load 'plot' ss=. 'ln(file sizes)' plotHistoMulti ^.>:FLSZS [ PCT=: 0
Write something to extract file suffixes (portion after final ".") so we can filter on file types:
getFlSuffix=: ] }.~ [: >: '.' i:~ ] getFlSuffix &.> 'hi.there';'none';'double.dots.txt';'file.txt';'file.csv';'short.el' +-----++---+---+---+--+ |there||txt|txt|csv|el| +-----++---+---+---+--+ $~.suffixes=. getFlSuffix &.> FLNMS 4525
Look at the frequency of occurrence of each file suffix using frtab:
frtab=: 3 : 0 cts=. #/.~ y NB. Count # of each unique item. if. -.isNum y do. NB. Special case enclosed, text mat, text vec, if. (L.=0:) y do. (<"0 cts),.<"0 ~.y else. if. 2=#$y do. (<"0 cts),.<"1 ~.y else. (<"0 cts),.<"0 ~.y end. end. else. cts,.~.y end. NB. and simple numeric vec. NB.EG (1 2 3 2 1,. 11 22 33 222 99) -: frtab 11 22 22 33 33 33 222 222 99 ) $frtab suffixes 4525 2 2{.frtab suffixes +----+--------+ |1 |+------+| | ||x86_64|| | |+------+| +----+--------+ |12 |+-----+ | | ||emacs| | | |+-----+ | +----+--------+
This does not look that useful, so sort the most frequently occurring first:
sufFrq=. (] 1}&.|:~ [: > 1}"1) frtab suffixes sufFrq=. sufFrq\:/0{"1 sufFrq NB. Suffix frequency of occurrence from high to low 5{.sufFrq +------+--------+ |557237| | +------+--------+ |39801 |dll | +------+--------+ |34876 |manifest| +------+--------+ |25413 |txt | +------+--------+ |25134 |h | +------+--------+
Building the Histogram
Try a preliminary histogram, notice a small problem with the suffixes being different cases, and fix this.
ss=. 'ln(.DLL File Sizes)' plotHistoMulti ^.>:FLSZS#~(tolower&.>suffixes)=<'dll' $~.suffixes 4525 $~.tolower&.>suffixes 4324 suffixes=. tolower&.>suffixes
Progressively build the histogram by checking what it looks like, then adding and changing parameters.
ss=. 'ln(.txt File Sizes)' plotHistoMulti ^.>:FLSZS#~suffixes=<'txt' BKTS=: -:i.50 whdll=. suffixes=<'dll' whtxt=. suffixes=<'txt' OTHERPLOTARGS=: 'key All DLLs TXTs' ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~-.whdll+.whtxt);(FLSZS#~whdll);<FLSZS#~whtxt whmanifest=. suffixes=<'manifest' whpix=. (suffixes=<'png')+.suffixes=<'jpg' OTHERPLOTARGS=: 'key All DLL TXT MANIFEST pics' ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~-.whdll+.whtxt+.whmanifest+.whpix);(FLSZS#~whdll);(FLSZS#~whtxt);(FLSZS#~whmanifest);<FLSZS#~whpix OTHERPLOTARGS=: 'key All DLL TXT pics' ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~-.whdll+.whtxt+.whpix);(FLSZS#~whdll);(FLSZS#~whtxt);<FLSZS#~whpix +/whinet=. (suffixes=<'htm')+.(suffixes=<'html')+.suffixes=<'php' 26794 OTHERPLOTARGS=: 'key DLL TXT pics internet' ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~whdll);(FLSZS#~whtxt);(FLSZS#~whpix);<FLSZS#~whinet
After several iterations like this, we've settled on the parameters we want, so we plot the log of the file sizes and manually add the underlying sizes corresponding to the logs in order to easily translate the sizes into a rough number of bytes. We have also manually added the basic code to re-produce the histogram onto the image.
Looking at this, we can tell, for example that most .txt files are between about 20,000 and 60,000 bytes in size; also, most DLLs are between 2,700 and 1.6 million bytes but there are about a thousand between 1.6 and 4.8 million bytes.