NYCJUG/2023-01-10
Beginner's regatta
At the meeting, Dan told us of many other odd characters (byte combinations) that crop up regularly when you have to gather input from many varied sources. However, the lowly NBSP was enough to discuss in the beginner section.
Problem with Non-breaking Spaces
It started with a cry for help:
From: Arthur Anger <ArtAnger@comcast.net> Date: Wed, Dec 28, 2022 at 12:47 PM Subject: [Jgeneral] Spelling errors To: J-General <General@jsoftware.com>
I compose and edit my scripts using TextEdit on a Macintosh. Occasionally I copy a line or two from a J-session into a script, and find that it may cause a Spelling error when used. Those errors are always at the beginning of a line, where an invisible print-control character has crept in.
Recently, I got Spelling errors on most blank spaces when trying to load a script, and had to retype each one.
Can anyone suggest a reason and a cure?
Henry Rich had a suggestion:
From: Henry Rich <henryhrich@gmail.com> Date: Wed, Dec 28, 2022 at 12:52 PM Subject: Re: [Jgeneral] Spelling errors To: <general@jsoftware.com>
You probably have nonbreaking spaces (codepoint 0xa0) that needs to be translated to 0x20. Other possibilities are smart quotes. Why don't you write a script that translate those characters on the clipboard? Others may want to use it too.
A Solution
It turned out to be not quite as simple as Henry had suggested though his diagnosis was correct. The complication arises from the fact that the non-breaking space was in Art's file as a Unicode byte sequence. Here is how we narrowed down the problem to a few characters.
Running my J session under emacs, I loaded the file Art had sent me:
load 'D:\amisc\J\Hk_LF.ijs' |spelling error | sff=: +/"2 ff , (|.!.0 ff) ,: (|.!.0 |.!.0 ff) | ^ | reinigen=:3 :0 |[-48] D:\amisc\J\Hk_LF.ijs
If we look at an image of this session, it is immediately clear which characters are suspect:
We see cyan underscores in the line flagged for the "spelling error". Since we are in an emacs session, we can select a suspicious-looking underscore and look it up in a.. This gives us a pair of values, confirming that this is a multi-byte Unicode sequence.
As we can see in the image above, simply replacing this pair of characters with a regular space character and writing the result to file allows us to load it without getting the spelling error.
Show-and-tell
Recently I have been posting past NYCJUG meetings to the J wiki. There are quite a few meetings with interesting topics, the information for which is languishing on my hard drive.
One recent set of notes from 2006 included this slight attempt to find solutions in J for the equation a^2 + b^2 + c^2 = (a - b)(b - c)(c - a).
In that section, we didn't try to solve the problem as stated which was to find integer solutions to this equation. Instead, we developed some iterative code which attempts to search for a close solution for non-integers as this can be searched incrementally in an arbitrary neighborhood.
Here is code to incrementally move from an arbitrary point to one closer to an exact solution.
NB.* findPoints.ijs: find points that are solutions for the given equality. eqn0=: 3 : '+/*:y' NB. Set up equations eqn1=: 3 : '*/2-/\y,{.y' diffeqns=: 3 : '(eqn0 y)-eqn1 y' NB. and differencer.
First we set up the two parts of the equation as eqn0 and eqn1, then we subtract one from the other in diffeqns so we have a function that approaches zero as we get closer to a solution.
The main routine takes an existing set of points or creates some random ones if none are given.
findPoints=: 3 : 0 'rndn np'=. 2{.y if. rndn-:' ' do. rndn=. (_1e2+?1e5$201)%>:?1e5$1e2 [ np=. 25 end. diffs=. 3 diffeqns\rndn close1s=. np{./:|diffs NB. Pick closest results as seeds=. close1s{3[\rndn NB. starting points... for_around. <"1 ] 1 _1*/~0.1^>:i.4 do. pts=. 0 3$0 for_seed. seeds do. nn=. >,&.>/&.>/seed +&.>/ <steps (>around),11 dd=. diffeqns&>nn pts=. pts,>nn{~<($dd)#:(,|dd)i.<./,|dd end. seeds=. pts end. pts ) steps=: 3 : 0 NB.* steps: vector of numbers from num, to num, in numsteps steps. 'from to numsteps'=. y from+(to-from)*(numsteps-1)%~i.numsteps )
We see that findPoints works by testing a lot of points and selecting those where diffeqns is closest to zero. These are used in the loop as starting points from which we generate a small grid of points around each one to select a new point even closer to the solution.
Converging to a Solution
The argument to the for_around. statement deserves some elucidation. This sets the value of the loop variable around to be each of these four successive intervals:
<"1 ] 1 _1*/~0.1^>:i.4 +--------+----------+------------+--------------+ |0.1 _0.1|0.01 _0.01|0.001 _0.001|0.0001 _0.0001| +--------+----------+------------+--------------+
Each of these intervals is used to generate a set of points around the initial seed points in the expression >,&.>/&.>/seed +&.>/ <steps (>around),11.
For example, for an arbitrary seed point and an interval from _0.1 to 0.1, we get 11 points spaced around the initial one.
seed=. _1.4 0.7 _1.9 around=. 0.1 _0.1 >seed +&.>/ <steps (>around),11 _1.3 _1.31818 _1.33636 _1.35455 _1.37273 _1.39091 _1.40909 _1.42727 _1.44545 _1.46364 _1.48182 _1.5 0.8 0.781818 0.763636 0.745455 0.727273 0.709091 0.690909 0.672727 0.654545 0.636364 0.618182 0.6 _1.8 _1.81818 _1.83636 _1.85455 _1.87273 _1.89091 _1.90909 _1.92727 _1.94545 _1.96364 _1.98182 _2
For each seed, we generate and evaluate points in a grid around the initial one. The best of these are selected for the next round which has a tighter around interval in order to get increasingly closer to a possible solution.
Back to the Original Problem
Interestingly, this method works well if we consider the original problem which restricted us to integer solutions. In fact, we can evaluate a set of integers directly without trying to converge on a non-integral solution. However, a three-dimensional grid of points can get arbitrarily large so we have to pick some maximum size of the cube with which we want to work.
Once we have tested the values in a particular cube, we can work on all the adjacent cubes, and so on.
Original Cube
We can generate points for our original cube (centered about the origin) like this:
$pts=. >(i:50),&.>/(i:50),&.>/i:50 101 101 101 3 7!:5<'pts' 3.35544e7
The points take up over 300 megabytes of memory. Do they look like what we expect?
2 2 2 3{.pts _50 _50 _50 _50 _50 _49 _50 _49 _50 _50 _49 _49 _49 _50 _50 _49 _50 _49 _49 _49 _50 _49 _49 _49 _2 _2 _2 3{.pts 49 49 49 49 49 50 49 50 49 49 50 50 50 49 49 50 49 50 50 50 49 50 50 50
Applying the objective function, we get these results.
$dd=. diffeqns&>pts 101 101 101 3
We locate the zeros and find their indexes by converting the integer indexes from I. into three-dimensional coordinates with ($dd)#: , then selecting the points to which these indexes correspond.
(<"1 ($dd)#: I. 0=,dd){pts _21 _14 _7 _14 _7 _21 _7 _21 _14 _2 _1 1 _1 0 1 _1 1 _2 _1 1 2 0 0 0 0 1 _1 1 _2 _1 1 _1 0 1 2 _1 2 _1 1 7 14 21 14 21 7 21 7 14
Verifying one of the points
diffeqns 14 21 7 0
Expansion Cubes
Thinking about how to build cubes of points adjacent to our original one highlights the power of array languages. Starting with the corners, we know that a cube has eight of them so what is the offset from our original cube for each of the eight corners?
Adding each of these rows gives us the 8 corner cubes.
]cornerShifts=. _100 100{~#:i.8 _100 _100 _100 _100 _100 100 _100 100 _100 _100 100 100 100 _100 _100 100 _100 100 100 100 _100 100 100 100
Adding each of these rows gives us the 6 face cubes.
]faceShifts=. ,/0 1 2|."(0 1)/ _100 100 ,"(0 1) 0 0 _100 0 0 100 0 0 0 0 _100 0 0 100 0 _100 0 0 100 0
These are the offsets which give us the 12 edge-aligned cubes.
]edgeShifts=. ,/0 1 2|."(0 1)/0,._100 100{~#:i.4 0 _100 _100 0 _100 100 0 100 _100 0 100 100 _100 _100 0 _100 100 0 100 _100 0 100 100 0 _100 0 _100 100 0 _100 _100 0 100 100 0 100
We see that a cube has 8 corners, 6 faces, and 12 edges so there are 26 cubes that are adjacent to this one. Since we are going from one cube to 3x3x3 cubes, this is the correct number of additional cubes. Also, per Euler, we know that F + V - E = 2, so these numbers make sense. We are using 100 and _100 because the original cube ranges from _50 to 50 on each dimension, so moving 100 from the origin in each dimension takes us to the limit of the existing cube with an overlap of one point.
Looking at the corner cubes:
$dd=. diffeqns"1 pts+"1/ ] cornerShifts 101 101 101 8 (<"1 ($dd)#:I. 0=,dd){pts+"1/ ] cornerShifts 55 85 100 60 75 105 _125 _100 _75 75 100 125 75 105 60 85 100 55 _105 _75 _60 100 55 85 _100 _85 _55 _100 _75 _125 100 125 75 105 60 75 _85 _55 _100 _75 _125 _100 125 75 100 _75 _60 _105 _60 _105 _75 _55 _100 _85 diffeqns _85 _55 _100 0
Evaluating the edge and the face cubes gives us no solutions.
(<"1 ($dd)#:I. 0=,dd){pts+"1/edgeShifts <./|,dd NB. Closest distance 1
We have no exact matches. How many are only off by one?
$(<"1 ($dd)#:I. 1=|,dd){pts+"1/edgeShifts 18 3
Similarly for the face cubes, there are no exact matches but we have some points off by only 2.
6!:2 'dd=. diffeqns"1 pts+"1/ ] faceShifts' 9.67004 (<"1 ($dd)#:I. 0=,dd){pts+"1/ ] faceShifts <./|,dd 2 (<"1 ($dd)#:I. 2=|,dd){pts+"1/faceShifts _33 _30 _89 _30 _89 _33 89 30 33 _89 _33 _30 30 33 89 33 89 30
Advanced topics
Recently there has been some discussion in the J community about things that seem to be missing from the language. Often this particular discussion focuses on data structures like trees. There are a few essays on the wiki about trees and I have not heard of any objections to those ways of working with trees so it seems that these approaches are adequate.
What are some other data structures or ways of processing that are widely available in other languages but seem to have to equivalent in J? One that has been mentioned often is the hash table. How would we build one in J?
I should note that this is a bare introduction to hash tables that overlooks many of the issues commonly associated with implementing hash-tables, such as handling overloaded keys and bundling insertion and deletion capabilities.
Hash Table
The key idea of a hash table is that if we can generate a unique key for each item in a large array and if the key is much smaller than an array element, we should be able to look up elements in the array more quickly than if we have to compare the full element.
What should we use for our large array? The work we've done on poker gives us the idea of looking up poker hands in a table comprising all possible hands. To do this, we need a function to generate all possible combinations of a given size from a set. A search of the wiki yields this link which gives us this definition:
comb=: 4 : 0 M. NB. All size x combinations of i.y if. (x>:y)+.0=x do. i.(x<:y),x else. (0,.x comb&.<: y),1+x comb y-1 end. )
A quick test shows us what to expect.
$3 comb 5 10 3 3!5 10 3 comb 5 0 1 2 0 1 3 0 1 4 0 2 3 0 2 4 0 3 4 1 2 3 1 2 4 1 3 4 2 3 4
To generate the table of all possible 5-card hands, we do this:
$RH5=: 5 comb 52 2598960 5 5!52 NB. How many should we expect? 2598960
Look at a few sample hands:
3{.RH5 0 1 2 3 4 0 1 2 3 5 0 1 2 3 6 _3{.RH5 46 47 49 50 51 46 48 49 50 51 47 48 49 50 51 showCards 100 1000 10000 100000 1000000{RH5 +---+--+--+---+--+ |K♣ |7♣|4♣|3♣ |2♣| +---+--+--+---+--+ |10♠|8♥|4♣|3♣ |2♣| +---+--+--+---+--+ |2♣ |2♥|A♣|10♦|3♣| +---+--+--+---+--+ |A♥ |9♥|8♣|4♠ |2♣| +---+--+--+---+--+ |6♣ |6♦|J♥|4♦ |3♠| +---+--+--+---+--+
This should give us unique hash values.
makeHash=: 52 #."(0 1) ] 6!:2 'hash5=. makeHash RH5' 10.7526 usus hash5 NB. minimum, maximum, mean, and standard deviation 146176 3.5053e8 5.96886e7 5.22142e7 (#-:#@:~.) hash5 1
This last expression compares the tally of the table to the tally of its nub. If they are equal, we have unique hashes.
#RH5 2598960 ixs=. 2598960?#hash5 NB. Some random rows to look up 6!:2 'wh=. RH5 i. ixs{RH5' 2.09659 wh-:ixs NB. Are the answers correct? 1
The basic lookup took about two seconds for about 2.6 million items. How does the hash compare?
6!:2 'wh=. hash5 i. ixs{hash5' 0.123261 wh-:ixs NB. Are the answers correct? 1
Time to Free Memory
When we extend this exercise to a table of six-card hands we see that the hash lookup is still much faster than the naïve one. However, we encountered some slowness in J's response to the creation of this hash vector.
6!:2 'RH6=: 6 comb 52' NB. All 6-card hands 0.66211 6!:2 'hash6=. 52#."(0 1) RH6' [ smoutput qts'' 2023 1 9 17 59 55.97 qts'' usus hash6 22.7858 2023 1 9 18 17 41.905 7.60116e6 1.78399e10 2.60503e9 2.36078e9 20358520 2023 1 9 17 59 55.97 TSDiff 2023 1 9 18 17 41.905 +--------+-------------------+ |_1065.93|0 0 0 0 _17 _45.935| +--------+-------------------+ (#-:#@:~.)hash6 1
Notice how the inputs following the generation of hash6 are flush with the left margin. This is because we entered them immediately after the expression for generating the hashes but it took more than the 22.8 seconds of compute time for J to be ready to accept further input. So the difference between the two timestamps - over 1,000 seconds - reflects a period where J was non-responsive even though the computation was finished.
Here we see how memory is slowly returned after peaking during the hash generation.
ixs=. 2598960?#hash6 NB. Same number of samples as in the 5-card case 6!:2 'hash6 i. ixs{hash6' 0.537749 6!:2 'RH6 i. ixs{RH6' 17.7146
Once again, we see that the hash lookup is much faster than the basic J expression.
Learning and Teaching J
There is this essay about algorithms with which every programmer should have experience. These are common algorithms much as hashing is so it would be interesting to see how well suited J is for coding them. There is already code for some of these which I note below.
Topological sort
We use topological sort to sort items that are dependent on other items. Common scenarios include determining the order of tasks for build systems, the evaluation order of cells in a spreadsheet, which classes a student should take each semester to graduate, and a variety of scheduling problems.
Here is the same graph represented two different ways; the latter one is topologically sorted.
More information:
Used in J code for writing a WAV file
Using topological sort in J for an Advent of Code problem
Topological sorting (Wikipedia)
Topological sort interactive visualization (web)
Recursive descent parsing
We have examples of this in J here and here.
This one felt like unlocking a super power when it clicked for me. If you ever need to ingest a recursive data format or make a programming language, this is the tool for you. Sketch out a grammar then map each grammar rule to a code function. Bam! It feels like it codes itself.
You could have a simple compiler whipped up in a few hours with it.
More information:
Regex lexer
Essay by Oleg Kobchenko includes an example of parsing JSON
Let's make a Teeny Tiny compiler (author's tutorial series)
Crafting Interpreters and Amazon
Recursive descent parser (Wikipedia)
Myers string diff
Processing strings is such a common programming task, but I can usually get away with brute forcing whatever it is I need to calculate. We've all learned about Levenshtein distance, but what if you need to show the differences between two strings in a sensible way?
The default algorithm used by git for showing diffs is the Myers diff algorithm.
More information:
The Myers diff algorithm
Myers diff algorithm - Code and interactive visualization
Bloom filter
Apparently Bloom filters are used internally in J.
Given the remark here about "a really big hash table", this might be a follow-on to the hash table work above.
The most known unknown data structure. In fact, I polled some friends at Big Tech about what data structure they'd suggest I add to this list and all of them said bloom filter!
It really only helps with large-scale problem, such as when you need a really big hash table, but it is a great introduction to probabilistic data structures. With very little memory, it can tell you if an item is not present in the table, otherwise it can only tell you maybe the item is present.
Google asked me two interview problems involving these back when I was in college. Have I ever needed this in practice? Nope!
More information:
Bloom filter (Wikipedia)
Let's implement a bloom filter
Bloom filters by example
Piece table
When I first tried making my own text editor, I stored all of the text in an array. But that hits performance problems right away (e.g., when the user inserts text anywhere but the end). Years later I even got asked how to implement a performant text buffer at an internship interview.
What should I have used? Well, there are a few options with different tradeoffs: rope, gap buffer, or piece table. With a piece table, you track the edits performed on text rather than only keeping the resulting text. It makes a lot of other features trivial to add (e.g., undo support and incremental saving).
Piece tables aren't only useful for text editors though. You can use it whenever you have a large buffer of data that can be arbitrarily modified.
More information:
Piece table (Wikipedia)
VS Code's text buffer story
The piece table
Splay tree
How about a self-optimizing data structure? Splay trees are binary trees that tend to have the more recently accessed elements closer to the root, and it does so without maintaining any additional metadata. Each time an element is accessed, it gets moved to the root.
It's a tree with a built-in cache and it is beautifully simple to implement! As a student, I had a really hard time implementing red-black trees (we wrote the code on paper, ugh) so I was pleasantly surprised when splay trees just clicked for me.
More information:
Splay tree interactive visualization
Splay tree (Wikipedia)
Implementation in TypeScript (GitHub)
Miscellaneous Other
If you are still craving more, here are the honorable mentions that I cut from the list:
- Trie - the lack of this has been cited as a shortcoming of J
- Segment tree
- Operational transformation
- Fibonacci heap
- Skip list
- Union find
Try these out, and let me know if I missed any!