NYCJUG/2007-03-13

Overview of J Wiki introductory material, notation FAQ, linear Diophantine equations

NYC JUG Meeting of 20070313

In this meeting, we talked about how we think about how the introductory material on the J Wiki might be better organized and what could be added to provide more motivation for learning J, the idea of a FAQ to address recurring questions on issues of "notation", solving linear Diophantine equations, and improving the performance of a long-running adverb.

Meeting Agenda for NYC JUG 20070313

             Meeting Agenda for NYC JUG 20070313
             -----------------------------------

                    Working with the J Wiki
                    -----------------------
1. Beginner's regatta: overview of intro material on J Wiki -
how might it be better organized?  Where is the motivational material?
See section 4 below.


2. Show-and-tell: in progress on J Wiki: notationFAQ,
linearDiophantineEquations.


3. Advanced topics: improving the performance of an adverb:
from "getVarInfo" to "gavInfo": from 150 hours to 12 hours.

Possible enhancements: how to divide more evenly in 2 dimensions?
Alternately, how to represent subdivisions within existing divisions?


4. Learning and teaching J: what motivational examples can we put
on the J Wiki?  How can we make FAQs more transparent, more geared
toward actual beginner questions?

            +.--------------------+.

Like waves of the sea,
events have two sides:
either you ride them out
or they ride you down.
  - Arabic proverb

Proceedings

We debated the layout of the J Wiki in the "Beginner's regatta", discussed what should be in a J language FAQ specifically in relation to notational "oddities", looked at some work on solving linear Diophantine equations, briefly discussed performance improvement of an adverb used in working on the Netflix Challenge, and discussed what examples we could provide to motivate people to learn J.

Beginner's regatta

The consensus on the J Wiki was that it was bit crowded and not sufficiently visually interesting. The initial page does not make one think "Really? Tell me more!"

We debated the idea of a splash page, common on many sites, but decided this was more annoying than compelling. One example of compelling J some of us like is the cover of Howard Peelle's J book: it's festooned with short J definitions having familiar names like "pascal" and "pythagoras".

Show and Tell

See [[[NYCJUG/notationFAQ]]] for the answer to a commonly asked question about one of J's prominent notational conventions: why is the default order of operation right-to-left?

See [[[NYCJUG/linearDiophantineEquations]]] for the discussion on these equations.

Advanced Topics

I developed the adverb "getVarInfo" to apply an arbitrary function (verb) to a list of variables on file specified by the 99 names in the vector "UVN" in the directory specified by "VDIR". This is for handling a very large array by breaking it into a hundred or so pieces. Instead of applying a function to the array directly, we do so indirectly using this adverb which applies the supplied left function "u" to each of the pieces on file.

So, an expression like dts=: (3&{) getVarInfo&.>(<VDIR);&.>UVN applies 3&{ to get the date row from each matrix on file.

Original Adverb

NB.* getVarInfo: apply arbitrary function to each (filed) var named.
getVarInfo=: 1 : 0
   'dd varnm'=. y.                 NB. Vars dir, var names.
   rc=. dd unfileVar_WS_ varnm     NB. Get var from file
   if. >{.rc do. rc=. 1;u. ".varnm NB. Do something to it
       [4!:55 <varnm               NB. Erase when done to conserve space
   end.
   rc
NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3'
NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN
)

The function "unfileVar_WS_" above is from "WS.ijs" found at [[[Scripts/File J Variables]]]. This function allows us to read and write a J variable from and to a file.

Modified Adverb

The newer version of this adverb takes two verbs instead of one. This allows us to apply the function of interest (see "accumCMRatings" below) to a group of arrays at a time by specifying an appropriate concatenation, ",." in this case, to several variables before applying the function of interest. For some functions, it's much faster to work on larger pieces.

NB.* gavInfo: apply arbitrary fnc "u" to all file-vars joined by fnc "v".
gavInfo=: 2 : 0
   for_dv. y do. 'dd varnm'=. >dv  NB. Vars dir, var names.
       rc=. dd unfileVar_WS_ varnm
       if. >{.rc do.
           if. -.nameExists 'cumvals' do. cumvals=. ".varnm
           else. cumvals=. cumvals v ".varnm end.
           4!:55 <varnm
       end.
   end.
   1;u cumvals
NB.EG (_2 ({."1,.{:"1) gavInfo ,.)\(<'C:\data\');&.>'var1';'var2';'var3';'var4'
)

Timings

The timings below, first for the original adverb "getVarInfo" then for the newer adverb "gavInfo" are mis-leading as presented. This is because the first version was taking so long - it had been running for days and was only about halfway through the files - I came up with the newer version while the older one was still running. I then moved the unprocessed files, 32 of the 99, to an alternate directory and ran the new version on those.

Once both versions finished, I combined the results.

Session Using Original Adverb getVarInfo

   CTMAT=: 0$~CMClassVars ''     NB. Initialize the global we'll be updating
   6!:2 'accumCMRatings getVarInfo&.>(<VDIR);&.>UVN'
409856.93
   0 60 60#:409856.93            NB. Number of seconds as hours, minutes, and seconds
113 50 56.93

Note that this timing was for the first 67 files whereas the following is for the remaining 32 files. A little forethought in the design of this code allowed it to fail gracefully when I pulled the rug out from under the first adverb by removing some of its files after it had started running. Note that this graceful failure, combined with good modularization, also helps for coarse-grained parallelism: I can run multiple instances of this code on distinct sets of files on separate machines or different cores of the same processor.

Now, we finish the job using the newer adverb "gavInfo" which we apply to blocks of 8 variables from file at a time using the scan adverb "\" with a negative number to specify non-overlapping windows. The value "8" happens to divide evenly into the 32 files I had remaining but this doesn't matter for the result. A short block at the end would have been processed just fine.

Separate Session Using Updated Adverb gavInfo

   CTMAT=: 0$~CMClassVars ''         NB. Initialize the global
   VD2=: 'C:\Data\Netflix\AltVDir\'  NB. Specify alternate file variables' directory
   6!:2 '_8 (accumCMRatings gavInfo ,.)\(<VD2);&.>UVN'
14266
   0 60 60#:14266
3 57 46

We see that we needed almost 114 hours for the first 67 files was whereas we processed the final 32 files in less than 4 hours. So, extrapolating to the full set, the first version would have taken about 168 hours versus about 12 for the newer adverb.

Verb and Sub-functions Used with Adverbs

NB.* accumCMRatings: group (cust,movie) ratings by averages partitions.
accumCMRatings=: 3 : 0
NB. VDIR unfileVar_WS_ 'umurd0'
   cn=. classifyCMRatings ptnVar y
   cr=. cn </. 2{y       NB. Customer-movie ratings by CM-class
   cn=. ~.cn             NB. Class number/partition
NB.   ctmat=. (NCC,NMC)$0   NB. Count # ratings/class
   nd=. >:<.10^.NCC*NMC  NB. Max # digits in total class number
   for_fnum. i.#cn do.   NB. Rating info into appropriate CM-class file-var
       vnm=. 'cmclass',(-nd){.(nd$'0'),":fnum{cn
       VDIR unfileVar_WS_ vnm
       (vnm)=: (".vnm),>fnum{cr
       VDIR fileVar_WS_ vnm
       4!:55 <vnm
       ix=. <(NCC,NMC)#:fnum{cn
       CTMAT=: ((#>fnum{cr)+ix{CTMAT) ix}CTMAT
   end.
NB. accumCMRatings getVarInfo&.>(<VDIR);&.>UVN[CTMAT=: 0$~CMClassVars ''
)

NB.* classifyCMRatings: find Cust-Movie class assuming "cbpv" and "mclass".
classifyCMRatings=: 3 : 0
   cc=. <:+/cbpv </ mean&>2{&.>y   NB. Customer class based on avg movie rating
   mc=. mclass{~(0{mvnums) i. ;0{&.>y
   cc=. cc#~(1{$)&>y               NB. Customer class/rating entry
   cn=. mc+NMC*cc                  NB. Class number: (cust, movie)
)

NB.* countBiClass: count # (cust,movie) per bidimensional equi-rating groups.
countBiClass=: 3 : 0
   if. 0=#y do. y=. 10 10 201 end.
   'cnb mnb bsz'=. y          NB. Cust # breakpoints, Movie # brkpts, block sz
   CMClassVars cnb,mnb        NB. Vars: NCC, NMC, cbpv, mbpv, cclass, mclass
   ctmat=. (cnb,mnb)$0
   for_cb. i.>.NCUST%bsz do.
       len=. bsz<.NCUST-cb
       'ct ix'=. <"1 |:frtab ,(mnb*cclass{~cb+i.len)+/mclass
       ix=. <"1](cnb,mnb)#:ix
       ctmat=. (ct+ix{ctmat) ix}ctmat
   end.
   ctmat
)

NB.* findEqualBinBreakpoints: find distinct values to partition vec equally.
findEqualBinBreakpoints=: 3 : 0
NB. avgbp=. /:~avgmr=. %/2{.MMR
   'nbp vals'=. y.
   vals=. /:~vals
   anpb=. nbp%~#vals          NB. Average number per bin
NB. Locate index of 1st instance of breakpoint value
   bpi=. vals i. vals{~<.0.5+anpb*i.>.anpb%~#vals
   bp=. bpi{vals
   bp=. bp#~whUnq bp
   bp;(<./,>./,mean,stddev) 2-~/\bp     NB. Stats on breakpoint differences
NB.EG 'bpv stats'=. findEqualBinBreakpoints 100;%/2{.MUR
)