NYCJUG/2007-03-13
Overview of J Wiki introductory material, notation FAQ, linear Diophantine equations
NYC JUG Meeting of 20070313
In this meeting, we talked about how we think about how the introductory material on the J Wiki might be better organized and what could be added to provide more motivation for learning J, the idea of a FAQ to address recurring questions on issues of "notation", solving linear Diophantine equations, and improving the performance of a long-running adverb.
Meeting Agenda for NYC JUG 20070313
Meeting Agenda for NYC JUG 20070313 ----------------------------------- Working with the J Wiki ----------------------- 1. Beginner's regatta: overview of intro material on J Wiki - how might it be better organized? Where is the motivational material? See section 4 below. 2. Show-and-tell: in progress on J Wiki: notationFAQ, linearDiophantineEquations. 3. Advanced topics: improving the performance of an adverb: from "getVarInfo" to "gavInfo": from 150 hours to 12 hours. Possible enhancements: how to divide more evenly in 2 dimensions? Alternately, how to represent subdivisions within existing divisions? 4. Learning and teaching J: what motivational examples can we put on the J Wiki? How can we make FAQs more transparent, more geared toward actual beginner questions? +.--------------------+. Like waves of the sea, events have two sides: either you ride them out or they ride you down. - Arabic proverb
Proceedings
We debated the layout of the J Wiki in the "Beginner's regatta", discussed what should be in a J language FAQ specifically in relation to notational "oddities", looked at some work on solving linear Diophantine equations, briefly discussed performance improvement of an adverb used in working on the Netflix Challenge, and discussed what examples we could provide to motivate people to learn J.
Beginner's regatta
The consensus on the J Wiki was that it was bit crowded and not sufficiently visually interesting. The initial page does not make one think "Really? Tell me more!"
We debated the idea of a splash page, common on many sites, but decided this was more annoying than compelling. One example of compelling J some of us like is the cover of Howard Peelle's J book: it's festooned with short J definitions having familiar names like "pascal" and "pythagoras".
Show and Tell
See [[[NYCJUG/notationFAQ]]] for the answer to a commonly asked question about one of J's prominent notational conventions: why is the default order of operation right-to-left?
See [[[NYCJUG/linearDiophantineEquations]]] for the discussion on these equations.
Advanced Topics
I developed the adverb "getVarInfo" to apply an arbitrary function (verb) to a list of variables on file specified by the 99 names in the vector "UVN" in the directory specified by "VDIR". This is for handling a very large array by breaking it into a hundred or so pieces. Instead of applying a function to the array directly, we do so indirectly using this adverb which applies the supplied left function "u" to each of the pieces on file.
So, an expression like dts=: (3&{) getVarInfo&.>(<VDIR);&.>UVN applies 3&{ to get the date row from each matrix on file.
Original Adverb
NB.* getVarInfo: apply arbitrary function to each (filed) var named. getVarInfo=: 1 : 0 'dd varnm'=. y. NB. Vars dir, var names. rc=. dd unfileVar_WS_ varnm NB. Get var from file if. >{.rc do. rc=. 1;u. ".varnm NB. Do something to it [4!:55 <varnm NB. Erase when done to conserve space end. rc NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3' NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN )
The function "unfileVar_WS_" above is from "WS.ijs" found at [[[Scripts/File J Variables]]]. This function allows us to read and write a J variable from and to a file.
Modified Adverb
The newer version of this adverb takes two verbs instead of one. This allows us to apply the function of interest (see "accumCMRatings" below) to a group of arrays at a time by specifying an appropriate concatenation, ",." in this case, to several variables before applying the function of interest. For some functions, it's much faster to work on larger pieces.
NB.* gavInfo: apply arbitrary fnc "u" to all file-vars joined by fnc "v". gavInfo=: 2 : 0 for_dv. y do. 'dd varnm'=. >dv NB. Vars dir, var names. rc=. dd unfileVar_WS_ varnm if. >{.rc do. if. -.nameExists 'cumvals' do. cumvals=. ".varnm else. cumvals=. cumvals v ".varnm end. 4!:55 <varnm end. end. 1;u cumvals NB.EG (_2 ({."1,.{:"1) gavInfo ,.)\(<'C:\data\');&.>'var1';'var2';'var3';'var4' )
Timings
The timings below, first for the original adverb "getVarInfo" then for the newer adverb "gavInfo" are mis-leading as presented. This is because the first version was taking so long - it had been running for days and was only about halfway through the files - I came up with the newer version while the older one was still running. I then moved the unprocessed files, 32 of the 99, to an alternate directory and ran the new version on those.
Once both versions finished, I combined the results.
Session Using Original Adverb getVarInfo
CTMAT=: 0$~CMClassVars '' NB. Initialize the global we'll be updating 6!:2 'accumCMRatings getVarInfo&.>(<VDIR);&.>UVN' 409856.93 0 60 60#:409856.93 NB. Number of seconds as hours, minutes, and seconds 113 50 56.93
Note that this timing was for the first 67 files whereas the following is for the remaining 32 files. A little forethought in the design of this code allowed it to fail gracefully when I pulled the rug out from under the first adverb by removing some of its files after it had started running. Note that this graceful failure, combined with good modularization, also helps for coarse-grained parallelism: I can run multiple instances of this code on distinct sets of files on separate machines or different cores of the same processor.
Now, we finish the job using the newer adverb "gavInfo" which we apply to blocks of 8 variables from file at a time using the scan adverb "\" with a negative number to specify non-overlapping windows. The value "8" happens to divide evenly into the 32 files I had remaining but this doesn't matter for the result. A short block at the end would have been processed just fine.
Separate Session Using Updated Adverb gavInfo
CTMAT=: 0$~CMClassVars '' NB. Initialize the global VD2=: 'C:\Data\Netflix\AltVDir\' NB. Specify alternate file variables' directory 6!:2 '_8 (accumCMRatings gavInfo ,.)\(<VD2);&.>UVN' 14266 0 60 60#:14266 3 57 46
We see that we needed almost 114 hours for the first 67 files was whereas we processed the final 32 files in less than 4 hours. So, extrapolating to the full set, the first version would have taken about 168 hours versus about 12 for the newer adverb.
Verb and Sub-functions Used with Adverbs
NB.* accumCMRatings: group (cust,movie) ratings by averages partitions. accumCMRatings=: 3 : 0 NB. VDIR unfileVar_WS_ 'umurd0' cn=. classifyCMRatings ptnVar y cr=. cn </. 2{y NB. Customer-movie ratings by CM-class cn=. ~.cn NB. Class number/partition NB. ctmat=. (NCC,NMC)$0 NB. Count # ratings/class nd=. >:<.10^.NCC*NMC NB. Max # digits in total class number for_fnum. i.#cn do. NB. Rating info into appropriate CM-class file-var vnm=. 'cmclass',(-nd){.(nd$'0'),":fnum{cn VDIR unfileVar_WS_ vnm (vnm)=: (".vnm),>fnum{cr VDIR fileVar_WS_ vnm 4!:55 <vnm ix=. <(NCC,NMC)#:fnum{cn CTMAT=: ((#>fnum{cr)+ix{CTMAT) ix}CTMAT end. NB. accumCMRatings getVarInfo&.>(<VDIR);&.>UVN[CTMAT=: 0$~CMClassVars '' ) NB.* classifyCMRatings: find Cust-Movie class assuming "cbpv" and "mclass". classifyCMRatings=: 3 : 0 cc=. <:+/cbpv </ mean&>2{&.>y NB. Customer class based on avg movie rating mc=. mclass{~(0{mvnums) i. ;0{&.>y cc=. cc#~(1{$)&>y NB. Customer class/rating entry cn=. mc+NMC*cc NB. Class number: (cust, movie) ) NB.* countBiClass: count # (cust,movie) per bidimensional equi-rating groups. countBiClass=: 3 : 0 if. 0=#y do. y=. 10 10 201 end. 'cnb mnb bsz'=. y NB. Cust # breakpoints, Movie # brkpts, block sz CMClassVars cnb,mnb NB. Vars: NCC, NMC, cbpv, mbpv, cclass, mclass ctmat=. (cnb,mnb)$0 for_cb. i.>.NCUST%bsz do. len=. bsz<.NCUST-cb 'ct ix'=. <"1 |:frtab ,(mnb*cclass{~cb+i.len)+/mclass ix=. <"1](cnb,mnb)#:ix ctmat=. (ct+ix{ctmat) ix}ctmat end. ctmat ) NB.* findEqualBinBreakpoints: find distinct values to partition vec equally. findEqualBinBreakpoints=: 3 : 0 NB. avgbp=. /:~avgmr=. %/2{.MMR 'nbp vals'=. y. vals=. /:~vals anpb=. nbp%~#vals NB. Average number per bin NB. Locate index of 1st instance of breakpoint value bpi=. vals i. vals{~<.0.5+anpb*i.>.anpb%~#vals bp=. bpi{vals bp=. bp#~whUnq bp bp;(<./,>./,mean,stddev) 2-~/\bp NB. Stats on breakpoint differences NB.EG 'bpv stats'=. findEqualBinBreakpoints 100;%/2{.MUR )