NYCJUG/2024-06-11
Beginner's regatta
We had an interesting discussion on multi-threading on the J forum this month. The discussion reminded us that J can do its own multithreading even on fairly basic operations simply by specifying a number of threads in the global environment by running 0&T.@''^:4'' prior to running a compute-intensive task like matrix multiplication.
Speeding Up Matrix Multiply
This email started a discussion on different ways to speed up a large matrix multiplication.
From: Thomas McGuire <tmcguirefl@gmail.com> Date: Wed, May 29, 2024 at 2:16 AM Subject: [Jforum] J matrix multiplication performance To: <forum@jsoftware.com>
I had forked J9.6 and added the apple M1/M2 patch Elijah Stone had placed in the old forum. I did see about a 33% decrease in the time it took to do the following multiply:
A =: 3000 4000?.@$0 B =: 4000 2000?.@$0 C =: A +/ . * B
I compiled with USE_OPENMP=1 and used brew to download libomp. I just thought I would see a more dramatic change. Anyone have any experience with this. This takes about 4.7 seconds when I time it.
Also on the old forum archive was a discussion about using threading to speed this up. But on my regular J9.6 beta 8 (downloaded from the install directory at the jsoftware site), I see no appreciable performance enhancement. I do see that it engages more than 1 core.
The threading discussion indicated that using:
0&T.@''^:4 ''
should provide a noticable speed up in calculating. But it takes about 6.3 seconds on my M2 max MacBook threads or no threads.
Any insight would be appreciated.
---
From: Elijah Stone <elronnd@elronnd.net> Date: Wed, May 29, 2024 at 2:29 AM
openmp should not affect anything in this case
33% was in the same ballpark as the speedup i saw from using the apple matrix multiply
---
From: Thomas McGuire <tmcguirefl@gmail.com> Date: Wed, May 29, 2024 at 5:29 AM
A correction it is PyTorch that has incorporated access to the Apple silicon GPUs not Numpy.
Tom McGuire
---
From: Thomas McGuire <tmcguirefl@gmail.com> Date: 5/29/24 00:56 (GMT-07:00)
My understanding of the Accelerate framework that Apple provides is that this operation is taking place on a matrix multiply coprocessor available to certain cores on the M2 chip. Not quite what I was hoping for.
In researching this it looks as though Numpy has taken some dedicated shaders Apple made for the Metal API and are able to force some matrix multiples onto the GPU. Do you know anything about doing that?
I’m not sure what I would do with it yet but it seems pretty cool that my high level J code could run on the GPU without me having to use separate API calls and go through the tedium of setting up a DLL interface.
Thanks for your late night reply, btw (at least in my timezone)
Tom McGuire
---
From: Ak <akinboo@gmail.com> Date: Wed, May 29, 2024 at 10:00 AM
How many threads have you spun up? (1T.)
Are you using threadpools other than threadpool '0'?
How many cores are available? (8 T.)
What is the instruction are you using to run the command?
Ak
---
My Results
I ran the above exercise on my (non-Apple) machine which boasts an Intel(R) Core(TM) i9-10900F CPU running at 2.80GHz; it has 10 cores, which gives me 20 logical processors. Interestingly, J shows that I have 20 logical processors and up to 63 threads:
8 T.'' 20 63
Setting up the fairly large inputs and running the matrix multiply gives us a base timing for only a single thread.
A =: 3000 4000?.@$0 B =: 4000 2000?.@$0 1 T. '' NB. How many worker threads do we have? 0 (10) 6!:2 'A +/ . * B' NB. Average time over 10 iterations 1.34811
Now I start four threads and time it again:
0&T.@''^:4 '' 1 T. '' NB. Check how many threads we have. 4 (10) 6!:2 'A +/ . * B' 0.303385
This improves performance by more than a factor of four, which is pretty impressive. I think this reflects running on five threads - the original base thread and four additional worker threads.
Adding more threads gives even faster execution but with diminishing speedups.
0&T.@''^:8 '' NB. Add 8 more threads 12 1 T. '' NB. Giving a total of 12 threads 12 (10) 6!:2 'A +/ . * B' NB. Even faster 0.201836
This shows a speedup from the original by a factor of almost seven, so it's not proportionally as much of a speedup as we saw with the first four threads added.
Even More Speed
This discussion evolved into one where people were working with compiled modules but this is outside the scope of a beginner section so we will cover the rest of the discussion in the Advanced Topics section below.
Show-and-tell
We look at some preliminary work demonstrating J to an audience of hackers. This is part of a talk I plan to give at the Hackers On Planet Earth conference - HOPE XV - this July in New York.
Adding Text to an Image
Recently I found an image on the web I wanted to save but the explanation for the image was fairly involved so I wanted to save the text of the explanation with the image as a single file.
Let's say it was this image of a painting by Peter Bruegel the Elder:
There is some text associated with this image, like this:
#text=. 0 : 0 The Peasant Wedding Banquet, 1568 by Peter Bruegel the Elder The bride is sitting under her bridal crown; it is unclear which of the others is the bridegroom. The feast is taking place in the barn, the wall behind the guests consisting of stacked-up straw or corn. Two ears of corn with a rake call to mind the work that harvesting involves. The plates are being carried around on a door taken off its hinges. The principal form of nourishment in those days consisted of bread, porridge, and soup. ) 502
How do we keep the text together with the image? We could add an area below the picture and paste the text into there as an image but this looks clunky. As a test, I simply appended the text to the file like this:
text fappend 'The Peasant Wedding Banquet.jpg' 502
As it turns out, this works just fine. When I double-click on the file name, it opens as an image but the text remains appended to it:
text -: _502{.fread 'The Peasant Wedding Banquet.jpg' 1
Appending Images
Seeing how easily this worked, and how easy it is to do this in J, I experimented with doing other things with image files, like appending one image to another.
(fread 'The Little Tower of Babel by Bruegel.jpg') fappend 'The Peasant Wedding Banquet.jpg' 351773
However, when I double-click on the resulting file, only the initial image shows up. The second image is in the file but hidden by the first one.
This got me thinking about steganography which is the hiding of data within a file.
Initial Attempts at Steganography
In a sense, my initial appending the text to the image file qualifies as steganography because, unless we know where to look, the text is hidden within that file. However, looking at the file in any sort of text editor, not as am image, quickly reveals the hidden text so it's not very well hidden.
Here we see a view of the image file within the emacs editor:
The "hidden" text is plainly visible at the end of the file.
This got me thinking about better ways to hide text in an image file. We could do something like encode the plain text before appending it to the file so it's not so evident but this complicates things because we have to know where the text is and how to decode it.
Some Embedding Experiments
It should be easy to hide a very small amount of text undetectably in a very large image, but it would be harder to hide a large amount of text in a smaller image. With this in mind, I started with a fairly small image as shown here:
This file is fairly small, about 180k for the 226 by 330 image.
#fread 'GreatlyReducedGWoK.png' 180914 $read_image 'GreatlyReducedGWoK.png' 226 330
For our text, let's use this version of the Gettysburg address:
text=. 0 : 0 Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth. Abraham Lincoln November 19, 1863 ) $text 1512
Another thing to consider is that jpeg files are lossy whereas other formats are not. If we hope to retrieve text we hide in a file, it may be safer to use a lossless format like .bmp or .png.
Here we look at the same file in three different formats to see that the lossless ones are the same but the lossy one differs.
$gwokb=. read_image 'GreatlyReducedGWoK.bmp' 226 330 $gwokj=. read_image 'GreatlyReducedGWoK.jpg' 226 330 $gwokp=. read_image 'GreatlyReducedGWoK.png' 226 330 gwokb -: gwokj NB. BMP differs from JPG 0 gwokb -: gwokp NB. But the two lossless ones are the same 1
Put Text at Beginning
Let's try the very simplest thing first: simply append the text to the start of the image.
gwokp1337=. ($gwokp)$(a. i. text),(#text)}.,gwokp NB. Turn text into numbers, add to beginning of image. gwokp1337 write_image 'GreatlyReducedGWoK1337.png' 298320 gwokp1337=. read_image 'GreatlyReducedGWoK1337.png' text -: a.{~(#text){.,gwokp1337 NB. Ensure we can retrieve the text 1
The problem with this becomes evident if we compare the new image with the original because they look different:
We see evidence of the "hidden" text as the absence of the top of the picture.
Distributing Text Evenly
Let's instead evenly distribute the text throughout the image to see what that looks like.
#,gwokp 74580 74580%#text NB. Distribute it evenly->how many pixels/text? 49.3254 NB. So we could distribute it evenly, with rounding, every 49.3254 pixels. >./ixs=. <.0.5+](74580%#text)*i.#text NB. Check highest index number 74531 $ixs 1512 gwokp13372=. ($gwokp)$(a. i. text) ixs},gwokp NB. Insert text evenly gwokp13372 write_image 'gwokp13372.png' 298320
This is a little better but there is still a noticeable distortion of the image:
We see the text showing up as regularly-spaced white dots.
Distributing Text Randomly
Let's try to hide it better by putting the text in random locations. We will have to deal with setting the random seed to a known value to subsequently extract the text but let's first just see how this would look.
$ixs=. (#text)?@$#,gwokp 1512 usus ixs 38 74469 37925.2 21683.3 gwokp13373=. ($gwokp)$(a. i. text) ixs},gwokp NB. Insert text randomly gwokp13373 write_image 'gwokp13373.png' 298320
This is a little better but the hidden text still distorts the image:
Getting More Sophisticated: Hiding in 3D
One thing to consider is that a color image may also be seen as a three-dimensional array if we separate out the RGB planes. Fortunately, we have a utility to convert between the two and three-dimensional representations.
rgb=: 4 : '(x$256)&#:`(256&#.)@.((x = [: {: $) *. 3 = [: # $) y'
This converts between the two differently dimensioned versions and is its own inverse.
$3 rgb gwokp 226 330 3 $3 rgb 3 rgb gwokp 226 330
This changes the shape of our array and essentially increases its number of elements, potentially giving us more hiding places. Let's go back to evenly distributing the text but in the 3D image.
#,3 rgb gwokp 223740 (#text)%~#,3 rgb gwokp 147.976 ixs=. <.0.5+((#text)%~#,3 rgb gwokp)*i.#text gwokp13374=. ($3 rgb gwokp)$(a. i. text) ixs},3 rgb gwokp NB. Insert text evenly gwokp13374 write_image 'gwokp13374.png'
This gives a slight improvement but is still noticeable in the image:
We still have regularly-spaced dots but at least they vary in color so are not quite as noticeable.
Future Directions
It seems we should develop more sophisticated hiding methods. One interesting possibility would be to combine more sophistication with the "text appended to the end of the file" method to append J code to retrieve arbitrarily complex embeddings of text in an image.
Advanced topics
We continue the discussion about multi-threading, started in the Beginner's Regatta above, and continue to a discussion of using compiled modules to achieve better performance.
Pooling Threads
We look at the improvement gained by pooling threads. Notice that the first input array is ten times the size used in the previous examples.
From: Ak <akinboo@gmail.com> Date: Wed, May 29, 2024 at 3:53 PM
An idea.
(0 T. ])"0 (2# 1 2 3 4) NB. Spinup 8 threads in 4 pools. aa=: 30000 4000?.@$0 NB. Larger array bb=: 4000 2000?.@$0 5 (6!:2,7!:2@])' cc=: aa +/ . * bb' 5 (6!:2,7!:2@])' cc=: ; aa t.'''' +/ . * bb' 5 (6!:2,7!:2@])' cc=: ; aa +/ . * t. (<''worker'';1) bb'
With a larger set I notice a reported (~5 % ) time improvement and a dramatic space improvement 5.5 e8 to 3kb for 5 runs each.
Ak
---
The 5% time improvement is nothing special - it's been remarked on the J forum how a speedup lower than, say, a factor of two is unlikely to be very significant. The space improvement does appear to be substantial but I could not replicate these results on my PC.
aa=: 30000 4000?.@$0 NB. Larger array bb=: 4000 2000?.@$0 1 T. '' 0 10 (6!:2,7!:2@])'aa +/ . * bb' NB. No worker threads 13.0457 5.36873e8 (0 T. ])"0 (2# 1 2 3 4) NB. Spinup 8 threads in 4 pools. 1 2 3 4 5 6 7 8 10 (6!:2,7!:2@])'aa +/ . * bb' NB. 8 worker threads in 4 pools 13.287 5.36873e8
Henry Rich responded to the pooling improvement (which I did not see) with this explanation:
From: Henry Rich <henryhrich@gmail.com> Date: Wed, May 29, 2024 at 4:00 PM
Interesting. If you have N threads in threadpool 0, x +/ . * y automatically gives each thread (#x)%N rows of x to calculate the product of. This is pretty good until the number of rows gets 'too small', probably ~128 or so.
Henry Rich
---
Compiled Code Improvements
I did not try to replicate any of the following reported results as they are a bit involved.
Objective C on Apple Using ChatGPT
Thomas McGuire started us down this particular avenue for performance improvement.
From: Thomas McGuire <tmcguirefl@gmail.com> Date: Fri, Jun 7, 2024 at 8:42 AM
Taking Raul’s statement to heart I did some more digging to see if I could use Apple’s Metal Performance Shader (MPS) API within J. As proof of concept I used the FFI available in the ‘dll’ library.
First not being an Apple developer the first problem with using Apple’s Metal API is that Apple likes to steer you toward Swift. There are some Metal APIs that can be used directly in C/C++ but not the MPS API. It is available to Objective-C, which it turns out if you write a typical C style function call it is accessible to J via the FFI facility in J.
There are some limitations. The MPS API only allows an 8000 x 8000 maximum for matrix multiplication. It also only uses 4 byte floats (aka float32). This is different from J which uses double as the underlying floating point representation. The Apple BLAS library also uses doubles. Which makes it easier to integrate right in the Source code as Elijah Stone had done.
I have a confession to make. Not being a swift or objective-c programmer I took an example swift program and took out the salient features I need and then asked ChatGPT to translate that into objective-c. It didn’t take much hand modification from there to get the MPS version of matrix product up and running and tested in a C calling program. I asked ChatGPT to make some J test code and write the FFI but it failed miserably and I had to do that myself.
If you have an M2 machine and are interested I made it installable into J via the ‘install’ command from GitHub.
install 'github:tmcguirefl/JmpsMP@main' load 'tmcguirefl/mpsMP/mpsmp.ijs’
then run: testMP 0
Here are my results with impressive timing on the MPS multiply
Regular J matrix product
1018.25 1013.55 999.638 1000.76 1015.25 1032.3 1006.25 1022.34 1019.53 1014.35
Now with Apple Metal Performance Shader API
1018.25 1013.55 999.638 1000.76 1015.25 1032.3 1006.25 1022.33 1019.53 1014.35
Standard J MP : 6.08373 6.71108e7
MPS MP : 0.066262 5.45267e8
Yes that appears to be about 100x speed up. So I guess we know why GPU processing is favored for neural network / AI stuff.
All the code is available on Github so I won’t reproduce it here except for the links.
Another Version of Objective C on Apple
The compilation route was extended with this version from LdBeth.
From: LdBeth <andpuke@foxmail.com> Date: Sat, Jun 8, 2024 at 12:17 PM
Actually, I think I have enough experience with ObjC to make this work.
I don't have M2 but only an Intel MacBook that runs macOS Monterey, but the Metal API is still available for the version of OS I installed. And I just managed to write an objc matrix multiplication example that works.
> It also only uses 4 byte floats (aka float32).
Supported data types are listed at
https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype?language=objc
Yes, there are complex16, complex32, float15, float32, uint32, uint64, but no float64.
ldbeth
More Details on Later Version
This was followed by some more detail, include some code listings.
From: LdBeth <andpuke@foxmail.com> Date: Sat, Jun 8, 2024 at 9:16 PM
Since the linked repo does not contain the source code of the shared lib for some reason, I just rolled my own, and not a surprise the performance gain is noticeable on Intel Mac.
Here is the full code File:Libmps.ijs, the command to compile the shared library is
clang -framework Foundation -framework CoreGraphics -framework Metal -framework MetalPerformanceShaders -fobjc-arc -shared -o libmps.dylib libmps.m
---
The code above is a J wrapper around the Objective C code which should write out the file libmps.m when run. The compile code can be used by this J code File:Mps.ijs.
Further Discusssion on Using Compiled Code
From: Thomas McGuire <tmcguirefl@gmail.com> Date: Sun, Jun 9, 2024 at 7:23 AM Subject: Re: [Jforum] J matrix multiplication performance To: <forum@jsoftware.com>
Yeah I had a brain freeze using my mail reader, ended up sending a good email to Raul personally instead of the forum. When I fixed that I forgot the links to the code. It took me a couple of days to figure that out and send the links in, sorry. But I like your objective-C better than mine.
So the next question is how to integrate it under the covers. For Elijah Stone's Apple libblas.dylib usage that can be added in at compile time by checking the architecture being built. This is nice in that the runtime only has the code for the particular architecture.
Adding in MPSMatrixMultiplication to the J source code and deciding when to use mps vs libblas for a particular matrix product. I was thinking of some combination of matrix size and print precision. - under a certain size requirement the interpreter would use the libblas or legacy approach, above that size use the mpsMatrixMult code. Then if you wanted to force a libblas (or legacy) approach, so you could gain precision over speed, the interpreter could look for a certain threshold in the print precision. With something like that I can program in J and let the interpreter decide how to carry out the calculation.
I have written some J code to overcome the 8000x8000 limitation mps has by breaking the multiplication into pieces from an algorithm I found here on Wikipedia (see the divide and conquer routine section): https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm:
My J implementation is attached File:RMP.ijs. It’s called rMP for recursive MatrixProduct. A 12000x14000 X 14000x13000 matrix product took 4.26 seconds with my version of the mpsMP. When I get a chance I will use it with LDBeth's MPSMatrixMul code.
Tom McGuire
---
Using BLAS without Compiled Code
Finally, Bill Lam shows us how to avoid the compilation by invoking BLAS directly from within J.
From: bill lam <bbill.lam@gmail.com> Date: Sun, Jun 9, 2024 at 5:14 AM
For USE_OPENMP binaries, you can test with BLAS enabled. Save this to a script gm1.ijs:
A=: 3000 4000?.@$0 B=: 4000 2000?.@$0 NP=: */ 3000 4000 2000 NB. using blas 0 (9!:58)"0 i.3 NB. +/ .* always use blas t1=: 6!:2'c2=: A+/ .*B' echo ' sec',~ ":t1 echo 'blas ',' GFlop ',~ 0j3": 2*NP%(t1)*1e9
Start jconsole with OMP environment variable.
bill:~% OMP_NUM_THREADS=1 jc ~/gm1.ijs 13.1143 sec blas 3.660 GFlop ^D% bill:~% OMP_NUM_THREADS=4 jc ~/gm1.ijs 3.96833 sec blas 12.096 GFlop ^D% bill:~% OMP_NUM_THREADS=8 jc ~/gm1.ijs 2.85146 sec blas 16.833 GFlop
---
We see how increasing the number of threads improves performance. Unfortunately, we see that Bill is using Linux and I was not able to get this work under Windows.
Learning and Teaching J
We look at some thoughts on how quickly development should be able to be done and a study of the parts of the brain used when coding.
Development Velocity
In this blog article, the writer surveyed 100 software engineers from some of the most well-known tech companies like Meta, Amazon, and so on. The single question he posed to them was "What’s stopping you from shipping faster?"
He was surprised to find these two common responses:
1. Almost no-one had to think about it. People shot back with an immediate answer, as though I had just asked about a pain that was always on their mind, but never surfaced by another person.
2. There were a surprisingly high number of people for whom the reason had to do with build, compile, and deployment times. Devs really hate waiting for things.
Those of us in the interpreted languages community are probably not surprised by this since absence of compilation is a reason many of us prefer our dynamic environments.
Breaking down the responses and categorizing them brought up these common complaints.
Dependency bugs
Software is a delicate of house cards with open source libraries all the way down. In a lot of software, the source code comprises mostly of dependencies, and very often your most frustrating bug is no even your fault. Jack, ex-Microsoft and ex-Bridgewater lists his #1 barrier to shipping faster -
>Hitting random mysterious bugs with libraries that require reading tons of old stack overflow links and Github issues
Complicated codebase
This one is closest to my heart. In the last decade as dev speed has evolved into a necessary competitive advantage, nearly every software team suffers from code sprawl. The logic is sound:
• → Growing startup
• → must ship faster
• → no time to write docs
• → early engineers churn as company grows
• → new engineers need to navigate the mess.
Maria, who is an SDE at Amazon says this about her team’s codebase:
>There is so much undocumented in our service, including poor records of new features, nonexistent or outdated info on our dependencies, or even essential things like best practices for testing, a lot of time is wasted in syncs trying to find the right information
Interestingly, this problem only gets worse with time -
>Nobody has time to update the documentation, which creates a vicious cycle where more features are shipped, causing the docs to be more outdated and useless, meaning no one updates or uses them, and so on. A lot of our docs haven’t been updated in years
There were also common complaints about the delays introduced by waiting for specifications to be finalized, waiting for stakeholder approval, and writing tests.
Another set of complaints having to do with the development process may also resonate with those of us used to developing in an interpreted environment is this:
Deployment/build speed
This one was quite surprising for me because I did not realize how many people suffered from this. I have a whole theory around the toxicity of idle time - time that is not a break but not productive either. Waiting for deployments is exactly that.
Deploy times can often be into hours. Here’s what Aryan, a developer at Arista Networks had to say about what slows him down:
> So for me at work it takes like 3-4 hours for a full build to complete that I can run all the tests for a PR on, and those test can take anywhere from 1d to 3d to get done, even with a lot of optimization
So even a small change, like modifying a filename can get tedious
The article is worth reading in full because it brings to light some very common things that slow down development, not all of which we are saved from by using more dynamic environments.
Reading Code Differs from Reading Language
This article summarizes results of brain scans done on people who were working on code while being scanned. The conclusion is that "interpreting code activates a general-purpose brain network, but not language-processing centers."
This is interesting to J developers because of our fascination with language in general, especially as shown by the language-oriented vocabulary adopted by J where we talk about "nouns", "verbs", and "adverbs".
As the article states, this result is somewhat surprising because
In some ways, learning to program a computer is similar to learning a new language. It requires learning new symbols and terms, which must be organized correctly to instruct the computer what to do. The computer code must also be clear enough that other programmers can read and understand it.
In spite of those similarities, MIT neuroscientists have found that reading computer code does not activate the regions of the brain that are involved in language processing. Instead, it activates a distributed network called the multiple demand network, which is also recruited for complex cognitive tasks such as solving math problems or crossword puzzles.
However, although reading computer code activates the multiple demand network, it appears to rely more on different parts of the network than math or logic problems do, suggesting that coding does not precisely replicate the cognitive demands of mathematics either.
“Understanding computer code seems to be its own thing. It’s not the same as language, and it’s not the same as math and logic,” says Anna Ivanova, an MIT graduate student and the lead author of the study.
One possible limitation of the study was that it was done on relatively inexperienced programmers.
The researchers say that while they didn’t identify any regions that appear to be exclusively devoted to programming, such specialized brain activity might develop in people who have much more coding experience.
“It’s possible that if you take people who are professional programmers, who have spent 30 or 40 years coding in a particular language, you may start seeing some specialization, or some crystallization of parts of the multiple demand system,” Fedorenko says. “In people who are familiar with coding and can efficiently do these tasks, but have had relatively limited experience, it just doesn’t seem like you see any specialization yet.”
It would be interesting to repeat this study with people who are more experienced to see if the results differ.