NYCJUG/2007-01-09
NYCJUG meeting, BEST, beginner's laments, introductions to J, Netflix prize, RMSE
NYC JUG Meeting of 20070109
This meeting was held in a different venue from usual as José Quintana and his company BEST (Bayesian Enhanced Strategic Trading) kindly hosted us in their Hoboken offices. In addition to providing us with our special guest Roger Hui, BEST also set out a lavish spread of pizza, sushi, and beer. Also, it was nice to have the facilities of their conference room with Tom Costigliola manning the wireless keyboard to give us a live J session on the large TV.
The NYC JUG would like to thank José and BEST for their generous sponsorship of our meeting.
We had particularly good attendance for this meeting, no doubt because of Roger's presence. We had Zach Reiter come all the way from Pennsylvania.
Taking the meeting in the order of the agenda, which we actually followed fairly closely this time, we start with our beginner's section.
Beginner's laments
1. Beginner's laments: we changed the title of this section as this better reflects the tenor of our discussion. Why isn't there more beginner-level reference material - not J introductions or primer but something along the lines of Sandra Pakin's "APL\360 Reference Manual"?
We passed around a copy of Pakin's reference manual. It's concise - though not as concise as the J vocabulary page - and well-organized. The major organizing principle for the explanations is to start with a description of the function followed by examples of its use. Actually, many of the detailed help pages linked from the vocabulary page follow this paradigm.
Another lament is that the conventions for documentation like the J Dictionary are not obvious and not spelled out. This was quickly dealt with as Roger pointed out that, by backing up from the J vocabulary page, we quickly find the explanatory material. The start of this can be found at [1]. Also, the Primer explains how to use the J Dictionary, though this explanation tends to be spread out across several pages.
Perhaps we need some kind of introduction to the introductions? It doesn't hurt to repeat things in different ways as long as the plenitude of information is organized. Also, the J Wiki is a powerful resource since it allows us each to make a small contribution as we see fit. For instance, if someone has trouble discerning the conventions of the dictionary the way they are currently organized, this person could create his own page with the conventions all spelled out in one place. This is also handy for self-organization and learning.
Beginner Resources Now Available
We also passed around a copy of Howard Peele's fine J book "Mathematical Computing in J, Volume 1 - Introduction". This thorough textbook is designed to present J in small pieces and offers many exercises. Howard's book is noted in the bibliography found in the help system self:Essays/Bibliography. Knowing how to find this bibliography is also helpful.
There are also the on-line books: "J for C Programmers" from Henry Rich, "Learning J" by Roger Stokes, as well as the several other introductions to be found in the J Help area. Other useful introductory material can be found in Cliff Reiter's File:Brief ref03.pdf as well as in the File:BriefJrefc Burke.pdf written by both Cliff Reiter and Chris Burke. These are both well-written, brief introductions to J.
Some criticisms of the beginner's material can be found in a compilation of some e-mails from 2005: File:JforBeginners-Criticisms.txt.
Show-and-tell
2. Show-and-tell: Netflix progress: movie clustering examples - difficulty with the move "Buffy" versus the TV series -> low RMSE versus many movies in common. See "rmseAdjustmentsSuggs.txt", and "NakedTruthRating.txt".
In this section, John Randall and I detailed some of our progress and problems using J to work on the Netflix prize [2].
In brief, the mail-order video rental company Netflix is running a competition, with a million-dollar prize, to improve their algorithms for recommending movies to their customers based on the customers' ratings of other movies. They provide a substantial amount of "training" data: as a simple array, it can be looked at as a 4 by 100 million matrix of integers. At least, this is how I set it up. There are a few, smaller arrays, like a list of movie titles, but this one large array is hard to deal with as a single object because of its size. Many people put it into a database, which is a logical choice, but this is not the most satisfying way to deal with an array for a J user. We've tried memory-mapped files but even these fall short because of the size of the whole array.
I've dealt with it by breaking it into pieces, numbered J variables, that I write to files using my variable-to-file package "WS". This is not entirely satisfactory but it suffices. I've written an adverb that allows me to apply an arbitrary verb to each variable on file, in effect treating them as the single, large variable.
Clustering Customers and Movies
Once we've figured out how to work with the large amount of data, the main question is what to do with it. One approach that seems obvious is to cluster customers into groups with similar tastes. This will give us a good estimate of a customer's movie ratings based on the cluster into which he falls. However, this is easier said than done. One problem is knowing if we have a good cluster.
An obvious way to measure the goodness of a cluster is to compare the RMSEs (root mean squared error) of the movie ratings of one customer versus another for the same group of movies. An important reason to use RMSE is that this is the measure on which Netflix judges the closeness of your ratings estimates to the (unknown to you) actual ratings in the target customer/movie set. However, there's also another way to have a general idea of the goodness of clusters: look at movie clusters.
An important insight John had was that the problem of customers rating movies is in some sense symmetric: we could also think of movies as rating customers. This leads to the notion of clustering movies in the same way we cluster customers. It seems logical that certain groups of people like certain type of movies. Also, since we have the movie titles, whereas customers are only identified by a number, it's easy to get an idea if the clusters we derive make sense.
For example, in the following, I compare two different clusterings against each other by looking at which movies cluster with "Buffy, the Vampire Slayer". I like this example because there is both a "Buffy" movie and a subsequent TV series based on the same idea. However, the movie was less popular than the TV series, hence is less correlated to the TV series than one might think. At the same time, the successive seasons of the TV series tend to be closely related.
In the first section below, I look at the cluster I get from the variable "newmvclst" for movie ID 9929 which is Buffy, (the TV series) Season 1.
NB. An example of (good and better) movie clustering: >titles{~UMVIDS i. 9929,12{.2{clst=: >newmvclst{~<1,~nids i. 9929 Buffy the Vampire Slayer: Season 1 Buffy the Vampire Slayer: Season 1 Alias: Season 2 Alias: Season 1 CSI: Season 2 Friends: Season 9 His Girl Friday Dial M for Murder The Day the Earth Stood Still Six Feet Under: Season 3 The Rescuers The Iron Giant: Special Edition Bedknobs and Broomsticks 8{."1 clst 14718 3421 3728 2726 2319 2122 1625 2051 0 1112366 1158103 1289241 1295163 1326877 1334051 1337674 9929 5292 1495 15861 9909 13378 4284 15037 NB. Even better: ids0=. ;0{"1 mvclst >titles{~UMVIDS i. 9929,11{.2{clst0=: >mvclst{~<1,~ids0 i. 9929 Buffy the Vampire Slayer: Season 1 Buffy the Vampire Slayer: Season 2 Buffy the Vampire Slayer: Season 3 Buffy the Vampire Slayer: Season 4 Buffy the Vampire Slayer: Season 5 Firefly Alias: Season 2 The X-Files: Season 2 Alias: Season 3 The X-Files: Season 1 Stargate SG-1: Season 1 CSI: Season 4 8{."1 clst0 9024 8835 8378 8053 3550 3421 3119 2845 493980 527565 551912 590289 1077556 1112366 1147645 1157083 5092 17682 12937 8226 2114 5292 14742 5738
The latter clustering is based on another variable "mvclst" which returns better results. You see the first 5 seasons of the show clustered together, followed by the series "Firefly". If you know anything about "Buffy", you know that it is the creation of Joss Whedon who is also the creator of "Firefly", so this makes sense.
The RMSE Trade-off for Clusters
By way of comparison, the cluster for the movie version of "Buffy" is not as clear-cut but seems less likely.
NB. An example of (poor) movie clustering: >titles{~UMVIDS i. 10996,12{.2{clst=: >mvclst{~mvc i. 10996 Buffy the Vampire Slayer: The Movie Stakeout Air America The Dead Pool Honeymoon in Vegas Necessary Roughness The Gauntlet Night Shift Gung Ho Addicted to Love Sea of Love Nick of Time Firefox NB. The cluster var: 0{count of customers in common; 1{1e6*RMSE of ratings NB. (by common customers) of movie; 2{movie id number: 5{."1 clst 2223 2158 1447 2196 2437 1099421 1100594 1101297 1109037 1110069 16366 9131 12376 5906 8679
This poorer cluster reflects a problem with using the RMSE-based measure, especially on less popular movies: there's a trade-off between low RMSE (lower RMSE = less error = better fit) and large sample size. In the example above, you can see that the five "best" elements in the cluster are large samples - as shown in row 0 of "clst" - but have poor RMSEs - all greater than one (the actual RMSE is multiplied by one million and rounded to keep the array all-integer).
Note how the prior example, in the section
8{."1 clst0 9024 8835 8378 8053 3550 3421 3119 2845 493980 527565 551912 590289 1077556 1112366 1147645 1157083 5092 17682 12937 8226 2114 5292 14742 5738
has sample sizes in the thousands and RMSEs around one-half for the first four movies in the cluster (seasons 2 through 5 compared to season 1). Ideally, we want low RMSEs based on a large samples, as in this "Buffy" - seasons 1 through 5 - example, but cannot always get this. The attachment File:RmseAdjustmentsSuggs.txt explores this trade-off in more detail.
There's also an issue with how consistent ratings are and how they tend to be biased upwards since most people don't watch, hence don't rate, movies they think they won't like. See File:NakedTruthRating.txt for an example of how some ratings are obvious outliers.
The wide world of not-J
3. The wide world of not-J: J versus Matlab: what are their respective attractions and shortcomings? Where are the well-documented interfaces to Visual Basic for forms, etc., maximizing known strengths of other languages? Why no .NET version of J? See Oleg's "Guides/.NET Interop".
John Randall spoke briefly on the topic of the attractions of Matlab in a university environment. The language is able to satisfy the needs of a wide variety of constituents, from those who program very little to those who get deeply into the code. A large part of this is because of the facilities offered by Matlab's add-on packages. Finance people who want to run standard things, like a portfolio optimization, find it easy to do this without delving too deeply into the language. There other packages for people mainly concerned with signal processing, and many others.
J offers some of this but perhaps not as nicely packaged. This is a shortcoming that may be addressed by some of the recent efforts in the JAL (J Add-on Library) and pacman (the package manager). Also, this is another place where we can leverage the power of the J Wiki. I often find it invaluable to have a few, annotated examples of how to use something, whether it be a powerful primitive verb like "oblique" ("/.") or a package like the images add-on.
Interfaces to Other Languages and Environments
Someone pointed out a shortcoming of the J GUI in that it doesn't implement all possible widgets. Members of NYCJUG have discussed this in the past and suggested that, since people seem to like Visual Basic's form designer, maybe J should be able to work with VB forms.
Another environment with which it would be good to work is Microsoft's .NET since this seems to be growing in popularity. There is something on the J Wiki about .NET interoperability which may address this: Guides/.NET Interop.
Another idea to address front-end issues is to use Microsoft Excel. This is useful for people who already are used to using this and already have their data in spreadsheets. There are a couple of tools for dealing with spreadsheets in J: some basic OLE code and the more elaborate Tara which is an extensive suite of tools to deal with spreadsheets using the biff8 format.
Odds and Ends
There is a command-line option to allow the new "break" mechanism to work on specific J sessions instead of breaking all of them: [3].
In response to traffic on the forum about a function to allow user input of a number from the keyboard within a J session:
getnumber=: 999 ". 1!:1@1:@(1!:2&2)
There also was some discussion about a model setting in form design around this point but my notes are incomplete and I don't recall what the upshot was. Does anyone who attended the meeting remember what this was about?