NYCJUG/2018-01-09
Meeting Agenda for NYCJUG 20180109
4. Learning and teaching J: still need to do "The Language Hoax" by John McWhorter.
In the world, who knows not to swim goes to the bottom. - G. Herbert, Outlandish Proverbs, 1640
Beginner's regatta
Learning How to Concatenate Rows
From: Nick S <simicich@gmail.com> Date: Mon, Jan 8, 2018 at 11:26 AM Subject: [Jprogramming] argument pairs To: Jsoftware Programming forum <programming@jsoftware.com>
I want to shuffle arrays, such that I pair them, a member of 1 with a member of the next, as:
'abc' ,"_1 'def' ad be cf ,'abc' ,"_1 'def' adbecf
But what I want to do is to set things up so the same sort of pairing with this as a right argument:
6 3 $ 'defghijklmnopqrstu'
Now, honestly, I had worked on this for a very long time for such a simple thing. I had tried every variation I could think of. Then I thought about it for a couple of days, read the language doc again, and finally decided to ask the question here. So I was formulating the question and, of course, this is part of another problem, the details of which are irrelevant here, so I was trying to create an isolated example, and I typed this expecting it to fail. I thought I had tried this before, but I had mostly worked with ravel, not ravel items, because the statement in the book implied (to me, anyway) that the two verbs were equivalent except with respect to rank.
,'abc' ,."1 (6 3 $ 'defghijklmnopqrstu') adbecfagbhciajbkclambncoapbqcrasbtcu
Exactly what I wanted. Of course, in practice, the dimensions will be variable but the length of the vector will always be equal to the second dimension of the y table. So the book says that ,. is the same as ,"_1 - but I can't figure out how to do what I did with , (ravel) no matter what I do with the ranks. I think this only works because of what it says in the description, that ,.
(ravel items) joins items of x to items of y.
Am I wrong about that? Is there a way to do this item by item lacing with ravel? Or can it only be done with ravel items? If, in fact, the verbs are not equivalent except with respect to rank, is it reasonably to reword the statement to indicate that somehow so that others don't chase down the blind alley? I can certainly understand that an unranked ravel items is the same as a ravel items with the rank specified as _1.
From: Henry Rich <henryhrich@gmail.com> Date: Mon, Jan 8, 2018 at 11:39 AM
,. is equivalent to ,"_1 :
,'abc' ,"_1"1 (6 3 $ 'defghijklmnopqrstu') adbecfagbhciajbkclambncoapbqcrasbtcu
Did you try two ranks?
Henry Rich
From: Nick S <simicich@gmail.com> Date: Mon, Jan 8, 2018 at 12:14 PM
I tried things like ,"_1 1, and variations on that but I didn't realize you could stack ranks like that, although I guess I should have. I guess that I understand very little about ranks, about all I know how to do reliably is to select columns instead of rows or compress a table in the non-default direction. Since it is a conjunction, though, I should have understood that I could have done this, since if it was a verb like
testv =: ,"_1
I would have expected to be able to say testv"1 -- which is exactly the same thing.
Thanks again. The explanation of rank in nuVoc is clearer to me than the explanation in the old voc, which I am still generally using. This is a good reason to change.
Show-and-tell
Concatenating CSV Files with Differing Headers
A client has generated several CSV files for different time periods. Each file has industry groups as its headers but these groups have changed over time, so the headers differ from file to file. How do we put all of these together with the superset of the headers, keeping the columns from each aligned under its proper header?
A J Solution
Here’s a J solution: first we set the name of the directory and get the names of the files based on their common prefix:
NB.* CSVaggregator.ijs: Aggregate many similar CSV files but with some change in headers between then, NB. so we aim to produce one aggregate files with the superset of the columns as the x-axis, NB. and all the time series concatenated horizontally as the y-axis. EG_usage_=: 0 : 0 dd=. 'C:\amisc\Clarifi\Arialytics\' NB. Top-level folder flnms=. {."1 dir dd,'SubindD*.csv' NB. All relevant CSV files to be combined load 'dsv' NB. Delimiter-separated values reader/writer ) NB.* getFirstLine: try very hard to get the first line of a text file. getFirstLine=: 3 : 0 (10{a.) getFirstLine y NB. Default to LF line-delimiter. : if. 0=L. y do. y=. 0;10000;y;'' end. 'st len flnm accum'=. 4{.y NB. Starting byte index, length to read, file name, len=. len<.st-~fsize flnm NB. any previous accumulation. continue=. 1 NB. Flag indicates OK to continue (1) or no if. 0<len do. st=. st+len NB. header found (_1), or still accumulating (0). if. x e. accum=. accum,fread flnm;(st-len),len do. accum=. accum{.~>:accum i. x [ continue=. 0 else. 'continue st len flnm accum'=. x getFirstLine st;len;flnm;accum end. else. continue=. _1 end. NB. Ran out of file w/o finding x. continue;st;len;flnm;accum NB.EG hdr=. <;._1 TAB,(CR,LF) -.~ >_1{getFirstLine 0;10000;'bigFile.txt' NB. Assumes 1e4>#(1st line). )
Now we make two passes through the files: first to gather and consolidate all the headers, then to aggregate the files under the superset of column names.
aggAllCSVs=: 3 : 0 'dd flnms'=. y firstLns=. ;_1{&>getFirstLine&.>(<dd),&.>flnms NB. All column headers firstLns=. (',') (I. LF=firstLns)}firstLns NB. Join together as one big CSV firstLns=. <;._1 ',',firstLns-.CR NB. Remove spurious trailing CR, itemize firstLns=. (' '={.&>firstLns)}.&.>firstLns NB. Remove spurious leading spaces firstLns=. ~.firstLns-.a: NB. Unique column headers without empties firstLns=. ({.firstLns),/:~}.firstLns NB. Alphabetize all but 1st col flAll=. ,:firstLns NB. Start w/all headers for_flnm. flnms do. fl=. (',';'') readdsv flnm fl=. ((' '={.&>0{fl)}.&.>0{fl) 0}fl NB. Remove spurious leading spaces fl=. ((0{fl)-.&.><CRLF) 0}fl NB. Remove spurious CRs & LFs fl1st=. 0{fl NB. Get this file’s headers. assert. (#fl1st)=+/fl1st e. firstLns NB. Expansion will not get length error fl=. ((]-.>./) fl1st i. firstLns ){"1 fl NB. Order these columns according to all fl=. (firstLns e. 0{fl)#^:_1"1 fl NB. Expand for columns missing in this one. flAll=. flAll,}.firstLns 0}fl NB. Tack new stuff on end. end. )
A Python Solution
However, this Python code from StackOverflow does the job perhaps more simply:
#* MergeCSVs.py: merge CSV files based on column names, not positions: in the # merged CSV, a cell should be empty if it comes from a line which did not have # the column of that cell. # From https://stackoverflow.com/questions/26599137/merge-csvs-in-python-with-different-columns import csv inputs = ["SubIndD_19850101to19971231_Return - daily - sub industry.csv","SubIndD_19980101to20071231_Return - daily - sub industry.csv","SubIndD_20080101to20180103_Return - daily - sub industry.csv"] # Determine the field names from the top line of each input file: since # we need to specify all the field names in advance to DictWriter, we loop # through all CSV files twice: once to find all the headers, and once to # process the columns. fieldnames = [] for filename in inputs: with open(filename, "r") as f_in: reader = csv.reader(f_in) headers = next(reader) for h in headers: if h not in fieldnames: fieldnames.append(h) # Then copy the data with open("out.csv", "ab") as f_out: writer = csv.DictWriter(f_out, fieldnames=fieldnames) for filename in inputs: with open(filename, "r") as f_in: reader = csv.DictReader(f_in) # Uses the field names in this file for line in reader: # At this point, line is a dict with the field names as keys, and the # column data as values. You can specify what to do with blank or unknown # values in the DictReader and DictWriter constructors. writer.writerow(line)
However, as given, this code fails to write out the aggregate file with the headers, so I had to add this paragraph before the “Then copy the data” section:
# Start with aggregate header CSVfldNms=','.join(fieldnames) CSVfldNms=CSVfldNms+'\n' initHdr=open("out.csv","w") initHdr.write(CSVfldNms) initHdr.close()
So, all in all, the two sets of routines are about the same size, with the J code surprisingly exceeding the Python at 48 to 39 lines. The respective utility, usability, or readability of the two solutions is beyond our scope here.
Advanced topics
Simple GUI Tip: Overwriting Last Line in JQt Session
From: Ian Clark <earthspotty@gmail.com> Date: Sat, Jan 6, 2018 at 1:31 AM Subject: Re: [Jprogramming] overwite? To: programming@jsoftware.com
In jqt you can take control of the term window, fetch its current contents, cut them back and rewrite the term window. Likewise other windows in the jqt IDE.
This link is a reference to help you do that: http://code.jsoftware.com/wiki/Guides/Window_Driver/Session_Manager
As Nick S says, there is also an addon which demonstrates the technique, viz.
load '~addons/general/misc/prompt.ijs' '2001 5 23' prompt 'start date: ' start date: 2001 5 12 ---overwriting 23 with 12 2001 5 12
And here's a rough'n'ready program of mine to cut back lines or bytes:
reterm=: 3 : 0 NB. rewrite contents of term window NB. cutting back by y bytes (y<0) NB. or by y lines (y>0) z=. >{:{.wd'sm get term' if. y<0 do. z=. y }. z else. for_i. i.y do. z=. z {.~ z i: LF end. end. wd'sm set term text *',z )
From: roger stokes <rogerstokes811@gmail.com> Date: Sat, Jan 6, 2018 at 5:08 AM
Ian,
Many thanks for that very valuable information
Regards
From: Ian Clark <earthspotty@gmail.com> Date: Sat, Jan 6, 2018 at 6:50 AM
You're very welcome, Roger.
Yeah… only discovered it myself recently – although J602 had comparable facilities.
Window Driver, aka JWD or wd, repays an evening's study. The first page to consult is here:
http://www.jsoftware.com/jwiki/Guides/Window%20Driver/Command%20Reference
…all the rest can be accessed from the Other Documents section.
If all you need is a utf-8 text window or two, why muck about with your own Qt windows when you can make the IDE sit up and beg? JWD really takes the pain out of Qt. It's so intuitive it's ideal for J beginners to pick up and be creative with. For a bit of 1970s nostalgia, it can even do character-based animations.
Learning and Teaching J: Language in General and for Programming
We look at the start of an essay about linguistic evolution, then a study of the bug-proneness of languages, ultimately grouped into types of language and types of bugs.
The Rise and Fall of the English Sentence
The surprising forces influencing the complexity of the language we speak and write.
BY JULIE SEDIVYNOVEMBER 16, 2017
“[[[When in the course of human events it becomes necessary for one people [to dissolve the political bands [which have connected them with another]] and [to assume among the powers of the earth, the separate and equal station [to which the laws of Nature and of Nature’s God entitle them]]], a decent respect to the opinions of mankind requires [that they should declare the causes [which impel them to the separation]]].” —Declaration of Independence, opening sentence
An iconic sentence, this. But how did it ever make its way into the world? At 71 words, it is composed of eight separate clauses, each anchored by its own verb, nested within one another in various arrangements. The main clause (a decent respect to the opinions of mankind requires …) hangs suspended above a 50-word subordinate clause that must first be unfurled. Like an intricate equation, the sentence exudes a mathematical sophistication, turning its face toward infinitude.
To some linguists, Noam Chomsky among them, sentences like these illustrate an essential property of human language. These scientists have argued that recursion, a technique that allows chunks of language such as sentences to be embedded inside each other (with no hard limit on the number of nestings) is a universal human ability, perhaps even the one uniquely human ability that supports language. It’s what allows us to create—literally—an infinite variety of novel sentences out of a limited inventory of words.
But that leads to a curious puzzle: Complex sentences are not ubiquitous among the world’s languages. Many languages have little use for them. They prefer to string together simple clauses. They may even lack certain words such as relative pronouns that and which or connectors like if, despite, and although—these words make it possible to link clauses together into larger sentences. Allegedly, the Pirahã language along the Maici River of Brazil lacks recursion altogether. According to linguist Dan Everett, Pirahã speakers avoid linguistic nesting of all kinds, even in structures such as John’s brother’s house. (Instead, they would say something like: Brother’s house. John has a brother. It is the same one.)
This can’t be pinned on biological evolution. All evidence suggests that humans around the world are born with more or less the same brains. Abundant childhood exposure to a language with layered sentences practically guarantees their mastery. Even adult Pirahã speakers, who have remained unusually isolated from European languages, pick up the trick of complex syntax, provided that they spend enough time interacting with speakers of Brazilian Portuguese, a language that offers an adequate diet of embedded structures. Sentences like the opening line of the Declaration of Independence simply do not occur in conversation.
More useful is the notion of linguistic evolution. It’s the languages themselves, rather than the brains, that have evolved along different paths. And just as different species are shaped by adaptations to specific ecological niches, certain linguistic features—like sentence complexity—survive and thrive under some circumstances, whereas other features take hold and spread within very different niches.
Brief Summary
This essay then proceeds to argue the case that the invention of writing has resulted in an acceleration of the evolution of language. The major part of the argument is that language complexity, as measured by recursion and depth of nesting of clauses, is significantly higher in cultures with written language than in those which are primarily oral.
This is offered as evidence:
The invention of writing sparked certain innovations such that by 1800 B.C., Akkadian texts already exhibited complex sentences that rival the prose of Henry James in their complexity. One such sentence (from Hammurabi’s Code of Law) proceeds like this: [If, [after the sheep and goats come up from the common irrigated area [when the pennants announcing the termination of pasturing are wound around the main city gate], the shepherd releases the sheep and goats into a field], the shepherd shall guard the field].
The argument continues that, even with such an ancient example of complex language, there is a measurable increase in the complexity of language through history.
Drivers of Evolution
Written language initially was mostly used to record spoken language but, as it evolved to aim at readers rather than listeners, textual language diverged from spoken language, leading to new grammatical tools supporting text but not speech. However, there are many well-written speeches - the Gettysburg Address springs to mind - that started as text.
Re-iterating a theme well-known to functional array-language adherents, the author also sees the trend toward greater complexity of language as being driven
...under the whips of two tyrants: time and memory. Our memories aren’t nearly capacious enough to allow us to compose and precompile each sentence before beginning to utter its first syllable. Instead, speaking is like driving with a general sense of the destination, but no clear route planned—we utter the first syllables of a sentence while taking a leap of faith that we’ll be able to choose the right words en route and formulate phrases adequately as the words tumble out of our mouths and bring us to an intersection in our thoughts that demands our next move. This puts an upper bound on complexity. But written text, which can be more deliberately planned out and revised, is able to transcend this.
In fact "...eye-tracking studies show that when we read, we break free of linear time and seize control over the flow of information, our eye movements lurching along at inconsistent speeds and frequently jumping back to earlier parts of a sentence which, during speech, would already be auditory vapor. Such freedoms invite the most glorious excesses of recursion in text."
There is some evidence that reading aids thinking. The author refers to a psychological study where the psychologists "...and her colleagues targeted relative clauses in the passive voice (the dog that was hit by the car), which are exceedingly rare in speech but more abundant in text, even that written for children. They found that heavy readers in the 8-12-year-old range produced such structures more often than children who read less. Even among adults, the production of these sentences was highly correlated with how much text they consumed, suggesting that avid readers are far more likely to transmit complex sentences to future generations."
However, I'm not convinced that it's a good thing to promote the use of passive voice, especially not as evidence of more complex thinking as a good thing.
Troubling Implications for Future Thinking
The essay raises the specter of how the language necessary for complex thought could fade away, speculating that if "...certain structures are too rare in speech to be reliably mastered by learners and passed on, then they may fade out within a community of non-readers. Naturally, this raises the question: Could syntactic complexity in literate languages diminish over time, if new technologies (podcasts, video lectures, and audiobooks) tether language more tightly to speech and its inherent limitations?"
In fact texts analyzed by Brock Haussamen reveal that "...the average sentence length in written English has shrunk since the 17th century from between 40-70 words to a more modest 20, with a significant paring down of the number of subordinate and relative clauses, passive sentences, explicit connectors between clauses, and off-the-beaten-path sentence structures."
Contrasting with this potentially gloomy outlook, the essay also points out that literacy is now "a universal basic necessity", so the general level of the ability to handle complex thoughts is probably higher than it has ever been, though the prevalence of absurd conspiracy theories could argue against this as an entirely good thing. Also, one might argue that Haussamen's evidence of shorter sentences in modern times might also reflect ongoing linguistic evolution to a more terse form which leads us to our next section.
Packing a Lot of Meaning into a Little Space
Another way that languages encourage more complex meanings is by packing many meanings into a single word.
Oral languages may avoid pushing the limits of syntax not just because they are bound to speech, but also because they have other ways to express complex meanings. Linguists take great pains to point out that languages with simple sentences erupt with complexity elsewhere: They typically pack many particles of meaning into a single word. For example, the Mohawk word sahonwanhotónkwahse conveys as much meaning as the English sentence “She opened the door for him again.” In English, you need two clauses (one embedded inside the other) to say “He says she’s leaving,” but in Yup’ik, a language spoken in Alaska, you can use a single word, “Ayagnia.”
This is, of course, what many of us see as the core strength of the symbolic, terse array languages.
The essay goes on to give examples of how very complex languages can be, but we have said enough about it here.
Which Languages are Bug Prone?
This article "Which Languages Are Bug Prone", written by Janet Swift, is a study of the effect of programming languages on software quality which was reported in October's Communications of the ACM. In this most-read news item of 2017 we report some of its major findings relating to the prevalence of bugs.
Researchers Baishakhi Ray, Daryl Posnett, Premkumar Devanbu and Vladimir Filkov used data from GitHub for a large-scale empirical investigation into the ever present debate among programmers as to which language is best for a given task. They combined multiple regression modeling with visualization and text analytics, to study the effect of language features such as static versus dynamic typing and allowing versus disallowing type confusion on software quality.
The short version of their conclusions is given in the abstract:
Language design does have a significant, but modest effect on software quality. Most notably, it does appear that disallowing type confusion is modestly better than allowing it, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. The object of the exercise was to shed light on the idea that the choice of programming language choice affects both the coding process and the resulting programming with the emphasis being on static versus dynamic typing: Advocates of strong, static typing tend to believe that the static approach catches defects early; for them, an ounce of prevention is worth a pound of cure. Dynamic typing advocates argue, however, that conservative static type checking is wasteful of developer resources, and that it is better to rely on strong dynamic type checking to catch type errors as they arise. These debates, however, have largely been of the armchair variety, supported only by anecdotal evidence.
For this investigation the team chose the top 19 programming languages from GitHub, adding Typescript as a 20th and identified the top 50 projects written primarily in each language. They then discarded any project with fewer commits than 28 (the first quartile) and any language used in a multi-language project with fewer than 20 commits in that language. [¿J didn't make the cut? ¡Que lastima!]
Which Languages Made the Cut
As the above table shows, this provided the study with 728 projects developed in 17 languages. The projects spanned 18 years of history and included 29,000 different developers, 1.57 million commits, and 564,625 bug fix commits.
Classes of Languages
Next the team defined languages classes, distinguishing between three programming paradigms: procedural, scripting and functional; two categories of type checking: static and dynamic; whether implicit type conversion is disallowed or allowed and managed memory as opposed to unmanaged: [J would fall into the categories functional, dynamic, implicit (type conversion allowed), managed]
Using keyword search for 10% of bug fix messages to train a bug classifier, the researchers identified both cause and impact for each bug-fix commit.
Bug Classes vs Language Types
The first question to be addressed was "Are some languages more defect-prone than others?" and this was done using a regression model to compare the impact of each language on the number of defects with the average impact of all languages, against defect fixing commits:
At the top of this table are variables used as controls for factors that are likely to be correlated. Project age is included as older projects will generally have a greater number of defect fixes; the number of developers involved and the raw size of the project are also expected to affect the number of bugs and finally the number of commits is bound to. All four were found to have significant positive coefficients. The languages with the strongest positive coefficients - meaning associated with a greater number of defect fixes are C++, C, and Objective-C, also PHP and Python. On the other hand, Clojure, Haskell, Ruby and Scala all have significant negative coefficients implying that these languages are less likely than average to result in defect fixing commits. With regard to languages classes, functional languages are associated with fewer defects than either procedural or scripting languages.
The researchers next turn their attention to Defect Proneness, the ratio of bug fix commits over total commits per language per domain and produce a heat map where darker color indicates more prone to bugs:
From the above heat map they conclude that there is no general relationship between application domain and language defect proneness. However looking at the relation between language class and bug category indicates that:
Defect types are strongly associated with languages; some defect type like memory errors and concurrency errors also depend on language primitives. Language matters more for specific categories than it does for defects overall.
As this heat map shows a strong relationship between the Proc-Static-Implicit-Unmanaged class and both concurrency and memory errors. it also shows that Static languages are in general more prone to failure and performance errors, these are followed by Functional-Dynamic-Explicit-Managed languages, such as Erlang.
Summing up the findings, the conclusions of the report are:
The data indicates that functional languages are better than procedural languages; it suggests that disallowing implicit type conversion is better than allowing it; that static typing is better than dynamic; and that managed memory usage is better than unmanaged. Further, that the defect proneness of languages in general is not associated with software domains. Additionally, languages are more related to individual bug categories than bugs overall.
For Fun
It's time for some simple visual displays!
Flipping Images
We play with simple image manipulation routines to achieve some possibly interesting effects.
Flipping on Two Axes
In an effort to force symmetry of an image, I hit upon the idea of joining an image to its vertical mirror image, along the vertical axis, then joining this to its horizontal mirror image along the horizontal axis. For example:
(] ,. |."1 ) i. 2 2 NB. Vertical flip and join 0 1 1 0 2 3 3 2 (] , |. ) i. 2 2 NB. Horizontal flip and join 0 1 2 3 2 3 0 1 (] , |. ) @: (] ,. |."1 ) i. 2 2 NB. Both flips and joins 0 1 1 0 2 3 3 2 2 3 3 2 0 1 1 0
As you can see, this imposes symmetry on a matrix. So, encapsulating this in a J verb:
dblDown=: (],|.)@:(],.|."1)
and using it on an image:
Seeing What Works
Playing around with alternates, we find several that give the right shape but are not correct but eventually find one that is:
$1 0 2|:(],|.) img 528 1056 3 (1 0 2|:(],|.) img) write_image 'flip0t23d.png' |
|
((],|.) &. (1 0 2&|:) img) write_image 'flip0t43d.png' |
Trying different combinations of flips, we find a more succinct way to express the 3D case where we maintain spatial integrity in the color dimension but are free to re-orient any of the others:
(],"2 |."2) i. 2 4 3 NB. Try on simpler 3D array first to eyeball result. 12 1 2 3 4 5 6 7 8 9 10 11 9 10 11 6 7 8 3 4 5 12 1 2 12 13 14 15 16 17 18 19 20 21 22 23 21 22 23 18 19 20 15 16 17 12 13 14