Beginner's Regatta

Finding Hex Words

There's a sequence of hexadecimal digits known to assembly language programmers: DEAD BEEF. It's a silly use of hexadecimal digits to spell recognizable words. It's of little use other than perhaps to make it easy to spot memory values initialized to this but not subsequently over-written.

In the spirit of silliness, we set out to find other words that can be spelled using only the letters A through F, though the following works only in lower-case.

First, we read in a large file of English words and break it into words.

   flnm=. 'e:\amisc\txt\Words\githubDwyl_english-words.txt'
   $wl=. <;._2 ] LF (] , [ #~ [ ~: [: {: ]) CR-.~fread flnm
466544

Next we select only those words represented in lower-case and having four or more letters.

   Alpha_j_                               NB. Standard variable in "j" namespace
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
   wl=. wl#~*./&>wl e. &.><26}.Alpha_j_   NB. Only lower-case alphabetics
   $wl=. wl#~4<:&>#&>wl                   NB. Only 4 or more letters
340017
   
   5{.wl                                  NB. First 5 words
+-----+------+----+-----+------+
|aahed|aahing|aahs|aalii|aaliis|
+-----+------+----+-----+------+
   _5{.wl                                 NB. Last 5 words
+---------+----------+-------+----------+------------+
|zwiebacks|zwieselite|zwitter|zwitterion|zwitterionic|
+---------+----------+-------+----------+------------+

Finally, we further filter down to only words consisting completely of the letters a-f, write these to file, and display them.

   hex=. 'abcdef'
   $hw=. wl#~ *./&>wl e.&.><hex
90
   hw
+----+-----+-----+-----+----+-----+----+----+----+------+-------+---...
|abac|abaca|abada|abaff|abed|abede|acad|acca|acce|accede|acceded|ace...
+----+-----+-----+-----+----+-----+----+----+----+------+-------+---...
   (;LF,~&.>hw) fwrite 'hexWords.txt'
539
   q:90
2 3 3 5

   8 12$96{.hw
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|abac  |abaca |abada |abaff |abed  |abede  |acad  |acca  |acce   |accede |acceded|acea   |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|aceae |aced  |addda |added |adead |aface  |afaced|affa  |baaed  |bacaba |bacca  |baccae |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|bade  |baff  |baffed|bead  |beaded|bebed  |bedaff|bedded|bedead |bedeaf |beebee |beef   |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|beefed|caba  |cabaa |cabbed|cabda |cace   |cadded|cade  |cadee  |caeca  |caff   |caffa  |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|ceca  |cecca |cede  |ceded |dabb  |dabba  |dabbed|daff  |daffed |dead   |deaf   |debe   |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|decad |decade|decaf |decd  |decede|deda   |dedd  |deed  |deeded |deedeed|deface |defaced|
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|defade|ebbed |ebcd  |ecce  |efface|effaced|faade |facade|facaded|face   |faced  |fade   |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+
|faded |faff  |feeb  |feed  |feeded|feff   |      |      |       |       |       |       |
+------+------+------+------+------+-------+------+------+-------+-------+-------+-------+

Show-and-tell

Setting a Random Seed

Sometimes when we are using random numbers, we don’t want to rely on the default setting for the random seed. This may be for a number of reasons such as spinning off a number of simultaneous processes that require different starting points.

Here is what J offers as random number generators (from https://code.jsoftware.com/wiki/Vocabulary/query):

1. J offers a choice of random-number generator (RNG):

Code	RNG Name	Time
1	GB_Flip	1
2	Mersenne Twister (default)	1
3	DX-1597-4d	3
4	MRG32k3a	8
0	Combination of all the above RNGs	12

The Time column gives an estimate of relatively how much time each one uses, where higher is slower.

Select the RNG with 9!:43 n where n comes from the table above, or use

 9!:42 ''

to see which RNG is selected.

2. The sequence of random numbers depends on the state of the RNGs. This state, which includes the choice of RNG to use, can be read by

RNGstate =: 9!:44 ''

and restored by 9!:45 RNGstate. The state is a list of boxes whose format should not be disturbed.

3. The state of the RNG can be reset by

9!:1 initvalue

where initvalue is an integer atom (or, for the Mersenne Twister generator only, an integer list) that selects a starting state. Resetting to a given initvalue always returns the RNG to the same initial state. Phrase 9!:0 will return the most recently used initvalue.

For RNGs 3 and 4, initvalue must not be 0.

Basic Idea for Seeding

We take something like a timestamp

6!:0 ''

because it gives us a different value nearly every time we call it, and flip it so the least significant portion is first, then reduce it to a single number with some base encoding like this we can use to set the seed:

   13#.|.6!:0''
2.19267e7
   9!:1 ] <.13#.|.6!:0''

We round down with “floor” because the argument to “9!:1” needs to be a whole number.

Comparing Different Methods

How well would this work if we were starting a number of processes nearly simultaneously? To figure this out, let’s look at the distribution of “<.13#.|.6!:0”. We will generate a number of random seeds with varying amounts of wait time between each to see how widely-distributed these seeds might be.

testRandSeedSet13=: 3 : 0
   outflnm=. 'RndTest13.txt'
   while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":13#.|.qts'') fappend outflnm end.
)

We will compare this with two variants:

testRandSeedSet0=: 3 : 0
   outflnm=. 'RndTest-7_11_13.txt'
   while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":-/7 11 13#."(0 1)|.qts'') fappend outflnm end.
)

testRandSeedSet3=: 4 : 0
   outflnm=. x
   while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":+/7 11 13 17#."(0 1)|.qts'') fappend outflnm end.
)

The wider the distribution, the better, so testRandSeedSet13 looks the best here.

A few other attempts give us better results:

testRandSeedSet1=: 3 : 0
   outflnm=. 'RndTest+7_11_13.txt'
   while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":+/7 11 13#."(0 1)|.qts'') fappend outflnm end.
)

testRandSeedSet3=: 4 : 0
   outflnm=. x
   while. _1<y=. <:y do. 6!:3]0.1*>:?20[ (LF,~12j0":+/5 7 11#."(0 1)|.qts'') fappend outflnm end.
)

testRandSeedSet6=: 4 : 0
   outflnm=. x
   while. _1<y=. <:y do. 6!:3]0.1*>:?20 [ (LF,~12j0":setSeed'') fappend outflnm end.
)

setSeed=: 3 : '<.(2^31)|13#.|.(6!:9]y),(6!:1]y),(6!:4]y),6!:0]y'

Plotting File Sizes

Here we use a histogram to get an idea of the distribution of the sizes of different kinds of files. We assume we have the file names in a vector of character strings FLNMS and the corresponding sizes in a numeric vector FLSZS. We create these by running code from here - parseDir.ijs - usually with buildTimedDiskInfo, used like this, for example, to gather information on the C: drive:

   ". buildTimedDiskInfo 'C' NB. Build vecs "FLNM_C_" etc.

Look at the file size vector:

   +/FLSZS
307166294265
   10^.+/FLSZS
11.4874

Load my statistics functions in order to run usus - the usual statistics: minimum, maximum, mean, and standard deviation.

   load 'mystats'
   usus FLSZS
0 2.67859e10 267342 2.95951e7

Load the plot routines which are used by the histogram plotter plotHistoMulti - which may be found here:

   load 'plot'
   ss=. 'ln(file sizes)' plotHistoMulti ^.>:FLSZS [ PCT=: 0

Write something to extract file suffixes (portion after final ".") so we can filter on file types:

   getFlSuffix=: ] }.~ [: >: '.' i:~ ]
   getFlSuffix &.> 'hi.there';'none';'double.dots.txt';'file.txt';'file.csv';'short.el'
+-----++---+---+---+--+
|there||txt|txt|csv|el|
+-----++---+---+---+--+
   $~.suffixes=. getFlSuffix &.> FLNMS
4525

Look at the frequency of occurrence of each file suffix using frtab:

   frtab=: 3 : 0
   cts=. #/.~ y          NB. Count # of each unique item.
   if. -.isNum y do.     NB. Special case enclosed, text mat, text vec,
       if. (L.=0:) y do. (<"0 cts),.<"0 ~.y else. 
           if. 2=#$y do. (<"0 cts),.<"1 ~.y else. (<"0 cts),.<"0 ~.y end.
       end.
   else. cts,.~.y end.   NB. and simple numeric vec.
NB.EG (1 2 3 2 1,. 11 22 33 222 99) -: frtab 11 22 22 33 33 33 222 222 99
)

   $frtab suffixes
4525 2
   2{.frtab suffixes
+----+--------+
|1   |+------+|
|    ||x86_64||
|    |+------+|
+----+--------+
|12  |+-----+ |
|    ||emacs| |
|    |+-----+ |
+----+--------+

This does not look that useful, so sort the most frequently occurring first:

   sufFrq=. (] 1}&.|:~ [: > 1}"1) frtab suffixes
   sufFrq=. sufFrq\:/0{"1 sufFrq      NB. Suffix frequency of occurrence from high to low
   5{.sufFrq
+------+--------+
|557237|        |
+------+--------+
|39801 |dll     |
+------+--------+
|34876 |manifest|
+------+--------+
|25413 |txt     |
+------+--------+
|25134 |h       |
+------+--------+

Building the Histogram

Try a preliminary histogram, notice a small problem with the suffixes being different cases, and fix this.

   ss=. 'ln(.DLL File Sizes)' plotHistoMulti ^.>:FLSZS#~(tolower&.>suffixes)=<'dll'
   $~.suffixes
4525
   $~.tolower&.>suffixes
4324
   suffixes=. tolower&.>suffixes

Progressively build the histogram by checking what it looks like, then adding and changing parameters.

   ss=. 'ln(.txt File Sizes)' plotHistoMulti ^.>:FLSZS#~suffixes=<'txt'
   BKTS=: -:i.50
   whdll=. suffixes=<'dll'
   whtxt=. suffixes=<'txt'
   OTHERPLOTARGS=: 'key All DLLs TXTs'
   ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~-.whdll+.whtxt);(FLSZS#~whdll);<FLSZS#~whtxt
   whmanifest=. suffixes=<'manifest'
   whpix=. (suffixes=<'png')+.suffixes=<'jpg'
   OTHERPLOTARGS=: 'key All DLL TXT MANIFEST pics'
   ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~-.whdll+.whtxt+.whmanifest+.whpix);(FLSZS#~whdll);(FLSZS#~whtxt);(FLSZS#~whmanifest);<FLSZS#~whpix
   OTHERPLOTARGS=: 'key All DLL TXT pics'
   ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~-.whdll+.whtxt+.whpix);(FLSZS#~whdll);(FLSZS#~whtxt);<FLSZS#~whpix
   +/whinet=. (suffixes=<'htm')+.(suffixes=<'html')+.suffixes=<'php'
26794
   OTHERPLOTARGS=: 'key DLL TXT pics internet'
   ss=. 'ln(File Sizes)' plotHistoMulti ^.&.>>:&.>(FLSZS#~whdll);(FLSZS#~whtxt);(FLSZS#~whpix);<FLSZS#~whinet

After several iterations like this, we've settled on the parameters we want, so we plot the log of the file sizes and manually add the underlying sizes corresponding to the logs in order to easily translate the sizes into a rough number of bytes. We have also manually added the basic code to re-produce the histogram onto the image.

Looking at this, we can tell, for example that most .txt files are between about 20,000 and 60,000 bytes in size; also, most DLLs are between 2,700 and 1.6 million bytes but there are about a thousand between 1.6 and 4.8 million bytes.

Advanced topics

[Jprogramming] Tagging Duplicate Values in a List

Learning and teaching J

NYCJUG/2019-01-15

Contents