NYCJUG/2014-08-05
Kaggle competition, Higgs boson, J conference 2014, k-NN, k-nearest-neighbors, machine learning
Meeting Agenda for NYCJUG 20140805
1. Beginner's regatta: starting the Kaggle competition for the Higgs Boson challenge: see "Higgs Kaggle Basics". 2. Show-and-tell: discuss recent J conference. 3. Advanced topics: k-NN: k-Nearest Neighbors method - how to implement in J? See "Machine Learning Example - k-NN in F# and OCaml" and "Comparing k-NN in Rust". 4. Learning, teaching and promoting J, et al.: lack of IDE as a virtue? See "Getting Back to Coding" and "Just Let Me Code"; also, "Software Fails the Test of Time".
Proceedings
We discussed a number of approaches to machine-learning in the context of addressing a Kaggle competition to make sense of Higgs boson data.
Beginner's regatta
Higgs Kaggle Basics
The "Higgs Boson Machine Learning Challenge" is a Kaggle competition described here. It states:
The goal of the Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment. No knowledge of particle physics is required. Using simulated data with features characterizing events detected by ATLAS, your task is to classify events into "tau tau decay of a Higgs boson" versus "background."
The competition started at 8:23 pm, Monday 12 May 2014 UTC and ends 11:59 pm, Monday 15 September 2014 UTC.
The problem is outlined in detail in this PDF.
A discussion of this competition on the J-Forum led to this note from Scott Locklin
from: Scott Locklin <scott@lugos.name> to: devonmcc@gmail.com date: Fri, Aug 1, 2014 at 4:02 AM subject: Higgs Kaggle
Hey, have you made any progress on this? I haven't looked at the data at all, but if you think it will submit to KNN, the libFLANN hooks I added with Bill Lam's help should speed things up for you considerably.
I haven't written a nice kernel regression object system for flann yet, but it's in the works (I more or less have to copy some work I did in Lisp) if you need such a thing. I don't have time to make any real contribution…
KNN tricks: local regression on KNN subsets works well sometimes. One thing which really ramps up KNN is to weight the features, either via online pruning (tough to do right) or just mutual information. KNN by itself is cursed with the naive bayes assumption that all the features are equally important; with some sensible weighting, you can get magic.
Another thing I'm fiddling with is CUR decomposition, which would be a more SVD/PCA type approach. It's trivial in J, and I'll be sticking it in my githubs when it's done. It uses lapack SVD for now, so it won't work on really big data or sparse arrays, but it's another TBD for me. Like I said, I haven't even looked at the data, so I have no idea what could be of use, but I know flann is a good NN library.
best regards,
-SL
See the Materials section below for more information on this FLANN library.
Competition Basics
The competition supplies two major sets of data: a training set and a test set. The training set consists of a .CSV file with 250,000 rows of 30 numbers representing measurements taken from the ATLAS experiments. This set has known "answers": a vector of 250,000 labels distinguishing measurements that represent a signal from those representing noise. The other set, the test set, is 550,000 rows of 30 values with unknown answers. The point of the competition is to use the training set to condition a set of algorithms to be able to distinguish signal from noise for the test set.
As of this NYCJUG meeting, I had J code only for the most basic operations of reading in the data and building a proposed set of signal versus noise answers for the test set in a form suitable to submission to the Kaggle site for evaluation. Evaluation is made as a single number, the "AMS statistic", scoring the correct answers versus the false positives and false negatives.
This code is shown below, followed by a sample session in which it is used to build a submission for submission to the contest.
Initial J Code for Kaggle "Higgs Boson" Competition
require 'tables/dsv' NB.* getTrainingData: break apart .csv file of training data into useful NB. arrays. getTrainingData=: 3 : 0 'trntit trn'=. split (',';'') readdsv y NB. Split title row from data trn=. }:"1 trn [ lbls=. ;_1{"1 trn NB. Separate character labels from numbers trn=. }."1 trn [ evid=. 0{"1 trn NB. Pull off event IDs as vec trn=. ".&>trn NB. Data as simple numeric mat trn=. }:"1 trn [ wts=. _1{"1 trn NB. Pull off weights column as vec trntit;lbls;wts;evid;<trn NB.EG 'trntit lbls wts evid trn'=. getTrainingData 'Data/training.csv' ) NB.* getTestData: read .csv file of test data into useful arrays. getTestData=: 3 : 0 'tsttit tst'=. split (',';'') readdsv y NB. Split title row from data tst=. }."1 tst [ evidt=. 0{"1 tst NB. Pull off event IDs as vec tst=. ".&>tst NB. Test data as simple numeric mat tsttit;evidt;<tst NB.EG 'tsttit evidt tst'=. getTestData 'Data/test.csv' ) NB.* calcAMS: calculate AMS metric based on R code in AMSscore.R. calcAMS=: 3 : 0 'wts actual guesses'=. y NB. Weights vector, actual values, predicted values. NB. Sum signal weights according to predicted label 's' or 'b'. 's b'=. guesses +//. wts*actual='s' 's b'=. ('s'~:{.guesses)|.s;b NB. Correct order from guess order br=. 10 NB. Regularization term to reduce variance of measure. %:+:s-~(s+b+br)*^.>:s%b+br )
Example Code Usage
Example J session showing how to use the above code to create a submission file.
NB. 1!:44 pp=. {path to code} 'trntit lbls wts evid trn'=. getTrainingData 'Data/training.csv' 'tsttit evidt tst'=. getTestData 'Data/test.csv' NB. Initial attempt: simple regression. coeffs=. (lbls='s')%.trn NB. Regress s=1 on factors ests=. trn+/ . * coeffs NB. Estimates based on regression 's'+/ . = lbls NB. Number of signals 85667 lbls #/. lbls NB. # 's' vs. 'b' 85667 164333 ('s'={.lbls)|.lbls #/. lbls NB. Put 'b' # first 164333 85667 threshold=. (/:~ests){~-'s'+/ . = lbls NB. Guess threshold for correct # of each signal (ests>:threshold) #/. ests NB. Verify correct number 'b' vs. 's' 164333 85667 guess1=. 'bs'{~ests>:threshold NB. Estimates to labels: 's' signal, 'b' background. calcAMS wts;lbls;guess1 NB. Measure of goodness is 19.8468 (higher is better) 19.8468 NB. Build submission file. submission=. ('EventId';'RankOrder';'Class'),(":&.>350000+i.#ests),.(":&.>>:/:/:ests),.<"0 guess1 submission writedsv 'Data/NEJedi0.csv';',';''
Show-and-tell
We discussed the J conference this year, including my own contribution here as well as some others hosted at the Journal of J.
Advanced topics
Elaborating on some of the techniques relevant to the Kaggle competition, we looked at some implementations of different nearest neighbor algorithms, specifically k-nearest neighbors. There is a good comparison of implementations in OCaml and F# here as well as one in the Rust language.
Learning and Teaching J
There seems to be increasing discontent with the massiveness of common programming environments, based on the responses to Andrew Binstock's essay in Dr. Dobbs, called Getting Back to Coding. He starts by noting
Last week's editorial (Just Let Me Code!) went viral and received literally hundreds of comments.... ...readers were very familiar with the frustration I expressed of not being able to code as much as I like because I'm spending time taming the tools that support my work: the IDE, the VCS, the defect tracker, test tools, and especially the build system. ... My conclusion was that this tool-level complexity (as opposed to the complexity of the code itself) had turned programming into an operational activity rather than a creative one.
This frustration is familiar to many of us in the J community, oriented as we are to a succinct notation that lets us get quickly to the heart of a problem. This is particularly noticeable in cases where, as in the previous Advanced Topics section, we have supposedly-working implementations of algorithms we would like to render in J but must wade through the unnecessary complexity and overhead imposed by other, less-fortunate languages. This is particularly noticeable if one attempts to validate such code by running it: the time spent simply getting a random algorithm to compile and run is seldom trivial.
This shortcoming is particularly poignant when we consider the points raised in this comment from a SlashDot discussion on age discrimination in the tech industry - Software fails the test of time - in which the author attests to how little progress software as such has made over the past 45 or so years. This rings especially true to those of us from the APL world where we feel like we've been faint voices in the wilderness, our good ideas widely ignored for decades now. It reminds me of the quote from Howard Aiken:
!#wiki gray/dotted Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats.