NYCJUG/2006-11-14
Beginner's regatta
The full rules for the Netflix Challenge cover about 10 pages so we will not go into detail on them but anyone interested can look here. We are most concerned initially with the size of the datasets and what they look like.
Netflix Challenge Dataset Overview
Data Item | # Records |
---|---|
Movies | 17,700 |
Customers | 480,189 |
Training records | 100,480,507 |
Probe records | 1,408,395 |
Graphical View of Datasets
We look at plots of statistics about the datasets.
Year of Movie
Here we see the number of movies in the database by year.
Rating per Movie
Here is the distribution of the ratings by movie. The axis is a log scale but the lower axis translates the logs back to original values.
Rating per User
Here is the distribution of the ratings by movie viewer. Again the axis is ln but the green numbers above it give the mapping back to original values.
RMSE Details of Some Baseline Forecasts
Here we compare the RMSE (root mean squared error) of some simple forecasting methods based on weighted averages of the ratings by customer and by movie.
Data Handling Code
Here is an attempt to process data more quickly by implementing a simple caching scheme using variables written to file.
This code is obsolete for a couple of reasons. One reason is that it was written for an older version of J in which the default right and left arguments, x and y, were written with a final dot, or as x. and y.. The other reason is that this dataset was large enough in 2006 that it was necessary to break it into pieces in order to fit it into memory. With contemporary machines, ca. 2022, the data used in pieces can be combined into a single array for each set of data.
It is important to remember that the amount of data in the NC dataset was a bit large for most home PCs, with perhaps one or two gigabytes of RAM, which is why we are forced to treat it in pieces.
So, since the data was too large to keep all in memory at once, we try caching it by file name where the files are arbitrarily large chunks of the data sorted either by movie or customer. The files each hold a J variable that is instantiated in the current namespace [using the WS code found here. These are pieces of a larger array to keep down memory usage.
NB.* fileBlocks.ijs: control access to file by caching and pre-processing blocks of text. coclass 'FB' coinsert 'fldir' [ load 'filefns' coinsert 'base' NB.* FWPI: global file name, value cache. NB.* BLKSZ: arbitrary amount of text to read in. NB.* rmFileEntry: remove entry from global FWPI by file name. NB.* getNextFileBlock: get next movie: numbers data block for named file. NB.* readNewBlocks: pre-process and cache text to numbers, feeding next datum. NB.* txt2MN2: boxed chars 'movie num:';'num1';'nums'... to numeric mvnum;nums BLKSZ=: 16000 NB.* BLKSZ: arbitrary amount of text to read in. NB.* rmFileEntry: remove entry from global FWPI by file name. rmFileEntry=: 3 : 0 y.=. boxopen y. if. nameExists 'FWPI' do. if. y. e. 0{FWPI do. NB. If we know about this file wh=. y. i.~0{FWPI FWPI=: (<<<wh){"1 FWPI NB. Remove entry end. end. 1 NB. Always succeed if completed ) NB.* getNextFileBlock: get next movie: numbers data block for named file. getNextFileBlock=: 3 : 0 NB. flnm=. dd,'probeRatings.txt' y.=. boxopen y. if. -.nameExists 'FWPI' do. NB. File Blocks FWPI=: 4 1$(3{.y.),<0 NB. File name, Whole, Partial blocks, Index end. if. y. e. 0{FWPI do. NB. If we know about this file wh=. y. i.~0{FWPI cblk=. >1{wh{"1 FWPI NB. Get complete blocks if. 0=#cblk do. NB. if any or, cblk=. readNewBlocks wh;y.;BLKSZ end. else. NB. Start tracking new file. wh=. 1{$FWPI NB. New entry goes at end. FWPI=: FWPI,.(3{.y.),<0 cblk =. readNewBlocks wh;y.;BLKSZ end. FWPI=: (<}.cblk) (<1,wh)}FWPI if. 0<#cblk do. ret=. 0{cblk else. '' end. NB.EG prbmu=. getNextFileBlock_FB_ 'C:\Netflix\probe.txt' NB.EG prbrtg=. getNextFileBlock_FB_ 'C:\Netflix\probeRatings.txt' ) t1_getNextFileBlock_test_=: 3 : 0 BLKSZ=: 12 [ svbs=. BLKSZ tmpfl=. 'gNFB.tst',~getTempDir '' tst=. '1:',LF,'111',LF,'2:',LF,'211',LF,'222',LF,'3:',LF,'311',LF tst=. tst,'322',LF,'333',LF tst fwrite <tmpfl assert. (1;,111)-:getNextFileBlock tmpfl assert. (2;,211 222)-:getNextFileBlock tmpfl assert. (3;,311 322 333)-:getNextFileBlock tmpfl BLKSZ=: svbs ) NB.* readNewBlocks: pre-process and cache text to numbers, feeding next datum. readNewBlocks=: 3 : 0 'wh flnm bs'=. y. flnm=. openbox flnm fsz=. fsize >0{wh{"1 FWPI NB. Account for file size to st=. >3{wh{"1 FWPI NB. Start from last point bs=. bs<.fsz-st NB. Reduce blocksize if necessary. fblk=. 0 [ cblk=. '' NB. Filled block flag; default empty block eofl=. 0=bs NB. End-of-file flag while. -.eofl+.fblk do. NB. End of file or full block? pt=. >2{wh{"1 FWPI NB. Use any left-over partial text. blk=. 1!:11]flnm;st,bs NB. Read from file. lfl=. blk i: LF NB. Last Full Line st=. lfl+st NB. New starting point in file at line FWPI=: (<st) (<3,wh)}FWPI if. st>:<:fsize flnm do. NB. End-of-file or near end if. LF~:{:blk do. lfl=. >:lfl end. blk=. pt,lfl{.blk NB. Partial text and all remaining FWPI=: (<'') (<2,wh)}FWPI NB. No more partial text FWPI=: (<fsize flnm) (<3,wh)}FWPI NB. At end-of-file eofl=. 1 else. blk=. pt,lfl{.blk NB. Use partial text from previous eofb=. LF i:~blk{.~blk i: ':' NB. Find end of full block FWPI=: (<eofb}.blk) (<2,wh)}FWPI NB. Save remaining partial blk=. eofb{.blk end. blk=. blk-.CR,' ' mvrt=. <;._1 (LF#~LF~:{.blk),blk NB. each line->box cblk=. txt2MN2&>(':' e.&>mvrt)<;.1 mvrt NB. Num,':' starts new block. fblk=. 0<#cblk end. cblk NB.EG mvnums=. readNewBlocks 1;'C:\Netflix\probeRatings.txt';16000 ) NB.* txt2MN2: boxed chars 'movie num:';'num1';'nums'... to numeric mvnum;nums txt2MN2=: 3 : 0 y.=. y-.&.><' :',CR split ,".&>(0{y.) 0}y. NB.EG txt2MN2 '1:';'4';'3';'5';'5' ) coclass 'base'
Show-and-tell
Here we look at two approaches for manipulating the large datasets for the Netflix challenge. First we will look at what John Randall came up with.
John's Preliminary Code for the Netflix Challenge
Here is what John has so far. He starts out with some documentation on how to use the code. You can see that he addresses the difficulty of keeping these datasets in memory by using memory-mapped files.
NB.* JRNflixChallenge.ijs: John Randall's Netflix Challenge data manipulation 0 : 0 Initial attempt at loading the data. The lines loadtraining NMOVIES loadprobe '' will load the training and probe data. There are also some utilities for dealing with mapped files. The data files are written to the directory JDATA, and are as follows. m,c,r: lists of the movie, customer and ratings in the training set. Date is currently ignored. mri: Since the movies in the training set are in order, this gives run starts and lengths. pi: an integer list, representing the probe data, where i{pi is the index of the ith probe datum in the training data. )
Next he defines some basic globals.
9!:7 '+++++++++|-' NB. Set box-drawing characters to ASCII ts=: 6!:2, 7!:2@] NB. Report time and space used by expression require 'files jmf'
Then he sets up globals to define locations of the datasets.
NB. Netflix download file paths setJRPaths=: 3 : 0 DOWNLOAD=: ('/',PATHSEP_j_) replace y. QUAL=:DOWNLOAD,'qualifying.txt' PROBE=:DOWNLOAD,'probe.txt' TRAINING=:DOWNLOAD,'training_set',PATHSEP_j_ NB.EG setJRPaths '/home/john/netflix/download/' ) setJRPaths '\Data\Netflix\'
He shows us some statistic and sets up utilities for memory-mapped files.
NB. Data file statistics NMOVIES=:17770 NB. # of movies NCUST=:480189 NB. # of customers NTRAINING=:100480507 NB. # of records in training db NPROBE=:1408395 NB. # of records in probe db NB. Directory root where mapped files are stored. JDATA=: DOWNLOAD,'jdata',PATHSEP_j_ NB. Utilities for mapped files. NB. All refer to JDATA directory NB. map 'name': read only access. NB. This is our default access, unlike the jmf conventions. map=:[: map_jmf_ ] ; ('';1) ;~ JDATA , ] unmap_z_=:unmap_jmf_ unmapall_z_=:unmapall_jmf_ NB. <array> writejmf <filename> NB. This is the only way we create mapped files at the user level. NB. They are subsequently opened read-only. writejmf=:4 : 0 data=.x fn=.JDATA,y size=.4* */$ data createjmf_jmf_ fn;size map_jmf_ 'JMF';fn JMF=:0 $ ~ 0,}. $ data JMF=:data unmap_jmf_ 'JMF' ) NB. Array access in mapped file: from={ NB. <indices> from <mapped file name> NB. Written with variable JMF to avoid collisions. from=:4 : 0 map_jmf_ 'JMF';(JDATA,y);'';1 r=.x { JMF unmap 'JMF' r )
He defines some utilities for run-length encoding and loading data.
NB. runs y NB. Utility for indices in run length encoding. NB. If y is a numeric list, runs y NB. returns pos,.len NB. where i{pos is the beginning of the ith run NB. and i{len is its length. NB. The identity y-:len # pos { y should hold runs=:(}:,.2&(-~/\))@:I.@:(1,1,~ 2&(~:/\)) NB. x lz y: pad y with leading zeros to size x lz=:13 : '((x-#y)#''0''),y' NB. tfn y gives full file name for movie y tfn=:13 : 'TRAINING,''mv_'',(7 lz ": y),''.txt''' NB. training y returns table m,c,r for movie y NB. At present, we are ignoring dates. training=:13 : 'y,"0 1 ".&> 0 2{"1 ;: }. ''m'' freads tfn y'
Finally, he defines the data loaders.
NB. loadtraining y creates training data for first y movies. NB. loadtraining NMOVIES creates full set of training data. NB. Arrays are assembled in memory, then written. NB. Could be faster, but only called once. loadtraining=:3 : 0 m=.0$0 c=.0$0 r=.0$0 for_i. >:i.y do. if. 0=100|i do. smoutput i end. t=.training i m=.m,{."1 t c=.c,1{"1 t r=.r,{:"1 t end. m writejmf 'm' NB. movie data c writejmf 'c' NB. customer data r writejmf 'r' NB. ratings data NB. movie run length index map 'm' (runs m) writejmf 'mri' unmap 'm' ) NB. Load probe data: loadprobe '' loadprobe=:3 : 0 probe=.<;._2 freads PROBE start=.(':'={:)&> probe c=.0&".&>&.> start <;._1 probe m=.(".@:}:)&>start # probe p=.;m (4 : 'x&,"0 &.>y')"0 c NB. p writejmf JDATA,'p' M=.{."1 p C=.{:"1 p mfret=.1,2&(~:/\) M NB. assert. (mfret#M)-:(~.M) mnub=.mfret#M cbox=.mfret <;.1 C NB. Now use loop to minimize memory r=.0$0 for_i. i.#mnub do. m=.i{mnub indices=.({.+i.@{:)(<:m) from 'mri' r=.r, indices {~ (indices from 'c')i. >i { cbox end. r writejmf 'pi' )
Devon's Preliminary Code
This code is largely concerned with pulling the information out of the text files in the Netflix Challenge dataset to create useful variables in J. It uses the WS functions, as found here, to break large datasets into sequentially numbered variables - like murd0 to murd99 - which are kept on file and loaded as needed.
This code, like the fileBlocks.ijs code above, is obsolete for the same reasons.
With those caveats, here is some old, obsolete code for your enjoyment though much of the explanation about the data is still valid.
NB.* mungeNetflixData.ijs: get Netflix data files into J vars. TheUsual=: 0 : 0 load (dd=. 'C:\Data\Netflix\'),'mungeNetflixData.ijs' load dd,'mungeNetflixData.ijs' ) NB.* VDIR: variables-on-file directory NB.* UVN: User var names NB.* MVN: Movie var names NB.* NUMITEMS: number of data items (MURD as below). NB.* rmse: root-mean squared error: fit criterion between 2 ratings series. NB.* baseCMAvgsRate: base case rater: use mean movie & cust rating as guess. NB.* baseCAvgRate: base case rater: use mean cust rating regardless of movie. NB.* baseMAvgRate: base case rater: use mean movie rating regardless of cust. NB.* makeRatingsFile: apply a rating function (movie rate custs)->ratings->file NB.* buildFVIndex: build index into vars on file. NB.* putNextFileBlock: put next movie: numbers data block at end of named file. NB.* startRRMCluster: start clustering users based on rarely-rated movies. NB.* rarestRated: find movies with fewest ratings. NB.* getMovieRecs: get all Movie, User, Rating, Date records for movie y.. NB.* getUserRecs: get all Movie, User, Rating, Date records for user y.. NB.* onemore: or bool w/next bit NB.* oneless: or bool w/prev bit NB.* updateVarInfo: replace each (filed) var named after applying function. NB.* txt2MN2: text of "movie: num1 num2..." file->numeric (movie num);nums. NB.* bucketBreakpoints: set values at which to break up a range into equal-sized NB.* varUserRatings: find variance of ratings for each user: assume uAvgs global NB.* varMvRatings: find variance of ratings for each movie: assume mAvgs global NB.* meanUserRatings: find mean ratings for each user: total and count. NB.* meanMovieRatings: find mean ratings per movie: total, count, std dev. NB.* createProbeRatingFile: read file of movie: users -> movie: scores NB.* writeProbeRatingFile: write ratings file given MURD var. NB.* getProbeFile: get file (movie1: users; movie2: users...)->movies<;.1 users NB.* selProbeScores: given MURD var, look up movie, user pairs in global MVUS. NB.* userRatings: get user ratings from 1 var. NB.* ordVars: order variables and re-save. NB.* userStats: statistics on user ratings. NB.* reordM2UFiles: re-order MURD (Movie, User, Rating, Date) files by User (or other row). NB.* initUserKeyedFiles: initialize user-keyed files for reordM2UFiles. NB.* loadMMF: attempt load of memory-mapped file: runs out of room>34e6 items. NB.* getVarInfo: apply arbitrary function to each (filed) var named. NB.* output1Piece: put out one piece of normalize (DB-formed) data: var "MURD". NB.* normizeTSData: normalize training set data->Movie num, User, Rating, Date. NB.* varizeTSFiles: convert training set files into J vars and save to file. NB.* howbig: Size of variable given its name. NB.* mungeTrainingSet: convert a training set text file into vars. NB.* mungeMovieTitles: put movie titles file into an array. coinsert 'fldir' [ load 'filefns' load 'jmf stats files dir mystats' dd=: 'C:\Data\Netflix\' load dd,'fileBlocks.ijs' NB.* "MURD[n]" variables are Movie, User, Rating, Date (all ints) tuplets as NB. 4-row mats broken into (about 100) usefully small pieces. These vars NB. are saved in files using the "WS" namespace function "fileVar"; they are NB. retrieved using "unfileVar". "umurd[n]" variables are the same info but NB. grouped primarily by user whereas "MURD[n]" are grouped by movie. VDIR=: '\Data\Netflix\VarsDir\' NB.* VDIR: variables-on-file directory NUMITEMS=: 100480507 NB.* NUMITEMS: number of data items (MURD as below). NB. "UVN" and "MVN" are namelists of "Movie, User, Rating, Date" vars broken NB. into <:16MB vars on file. Both var sets have same data but grouped by NB. user or movie. UVN=: flpfx&.>tolower&.>{."1 jfi dir VDIR,'umurd*.dat' NB.*UVN: User var names nn=. ".&>UVN#~&.>UVN e.&.><DIGITS UVN=: UVN{~/:nn NB. Put in numeric order. MVN=: flpfx&.>tolower&.>{."1 jfi dir VDIR,'murd*.dat' NB.*MVN: Movie var names nn=. ".&>MVN#~&.>MVN e.&.><DIGITS MVN=: MVN{~/:nn NB. Put in numeric order. NB. Get vars w/unique user ids, movie ids, start/stop MURD vars (UVN and MVN). (<VDIR) unfileVar_WS_&.>'UMVIDS';'UUIDS';'IXMV';'IXUV' NB.* rmse: root-mean squared error: fit criterion between 2 ratings series. rmse=: 4 : '%:(#y.)%~+/*:x.-y.' NB.* baseC65M35AvgsRate: base case rater: rate using 35% movie & 65% cust rating. baseC65M35AvgsRate=: 4 : 0 NB. Ensure these four vars exist before running this for performance. NB. if. -.nameExists 'MWT' do. MWT=: 0.35 end. NB. if. -.nameExists 'CWT' do. CWT=: 0.65 end. NB. if. -.nameExists 'UMS' do. VDIR unfileVar_WS_ 'UMS' end. NB. if. -.nameExists 'MMS' do. VDIR unfileVar_WS_ 'MMS' end. cavgs=. (0{UMS){~UUIDS i. y. NB. Average user rating mavgs=. (#y.)$(0{MMS){~UMVIDS i. x. NB. Average movie rating (CWT*cavgs)+MWT*mavgs NB. Weighted average of 2. ) NB.* baseCMAvgsRate: base case rater: use mean movie & cust rating as guess. baseCMAvgsRate=: 4 : 0 NB. Ensure these two exist before running this for performance. NB. if. -.nameExists 'UMS' do. VDIR unfileVar_WS_ 'UMS' end. NB. if. -.nameExists 'MMS' do. VDIR unfileVar_WS_ 'MMS' end. cavgs=. (0{UMS){~UUIDS i. y. NB. Average user rating mavgs=. (#y.)$(0{MMS){~UMVIDS i. x. NB. Average movie rating -:cavgs+mavgs NB. Simple average of 2. ) NB.* baseCAvgRate: base case rater: use mean cust rating regardless of movie. baseCAvgRate=: 4 : 0 if. -.nameExists 'UMS' do. VDIR unfileVar_WS_ 'UMS' end. (0{UMS){~UUIDS i. y. NB. Guess is average user rating. ) NB.* baseMAvgRate: base case rater: use mean movie rating regardless of cust. baseMAvgRate=: 4 : 0 if. -.nameExists 'MMS' do. VDIR unfileVar_WS_ 'MMS' end. (#y.)$(0{MMS){~UMVIDS i. x. NB. Guess is average movie rating. ) NB.* makeRatingsFile: apply a rating function (movie rate custs)->ratings->file makeRatingsFile=: 1 : 0 'infl outfl dtflg'=. 3{.y.,<'0' NB. Some files have a date to be removed. recctr=. 0 rmFileEntry_FB_ infl while. 0~:#fb=. getNextFileBlock_FB_ infl do. 'mv usrs'=. fb if. dtflg do. usrs=. usrs#~0 1$~$usrs end. ratings=. 0.1 roundNums mv u. usrs (mv,ratings) putNextFileBlock outfl recctr=. >:recctr end. recctr NB.EG rc=. baseCMAvgsRate makeRatingsFile (dd,'probe.txt');(dd,'rCMp.txt');0 ) NB.* buildFVIndex: build index into vars on file. buildFVIndex=: 3 : 0 stend=. ({."1,.{:"1) getVarInfo&>y. if. *./rc=. >{."1 stend do. 3{.>,.&.>/(i.#stend),&.>1{"1 stend else. (-.rc)#1{"1 stend end. NB.EG buildFVIndex (<VDIR);&.>MVN ) NB.* putNextFileBlock: put next movie: numbers data block at end of named file. putNextFileBlock=: 4 : 0 (;LF,~&.>(':',~":{.x.);":&.>}.x.) fappend y. NB.EG putNextFileBlock_FB_ ) NB.* makeProbeAnswerFile: given probe file (movie: customers...) return ratings makeProbeAnswerFile=: 3 : 0 'infl outfl'=. y. rmFileEntry_FB_ infl [ ferase outfl NB. Start at beginning recctr=. 0 while. 0~:#fb=. getNextFileBlock_FB_ infl do. 'mv usrs'=. fb 'rc mu'=. getMovieRecs mv (mv,(usrs i.~ 1{mu){2{mu) putNextFileBlock outfl recctr=. >:recctr end. recctr ) NB.* startRRMCluster: start clustering users based on rarely-rated movies. startRRMCluster=: 3 : 0 NB. MMR: Mean movie ratings: 0{sum of ratings; 1{# ratings; 2{std dev ratings. if. -.nameExists 'MMR' do. VDIR unfileVar_WS_ 'MMR' end. 'n2r outpfx'=. y. for_ii. i.#UVN do. VDIR unfileVar_WS_ vn=. 'umurd',":ii outvn=. outpfx,":ii (outvn)=: n2r rarestRated ".vn VDIR fileVar_WS_ outvn 4!:55&><&.>vn;outvn end. NB. startRRMCluster 20;'rr20mc' ) NB.* rarestRated: find movies with fewest ratings. rarestRated=: 4 : 0 uptn=. 1,2~:/\1{y. NB. y. is murd var sorted by 1{ mrbu=. uptn<;.1 ] |:0 2{y. NB. Movie & rating by user NB. movie popularity: low count means movie rarely rated. mp=. ((<UMVIDS)i.&.>0{"1&.>mrbu){&.><1{MMR nm=. x.<.#&>mrbu NB. x <. Number of movies rated by user lrix=. nm{.&.>/:&.>mp NB. Indexes of some of the most rarely rated. (<"0 uptn#1{y.),.|:&.>lrix{&.>mrbu ) NB.* getMovieRecs: get all Movie, User, Rating, Date records for movie y.. getMovieRecs=: 3 : 0 if. y. e. UMVIDS do. vn=. (0 i.~y.>1{IXMV){0{IXMV VDIR unfileVar_WS_ varnm=. 'murd',":vn vv=. ".varnm 4!:55 <varnm recs=. 1;vv#"1~y.=0{vv NB. Succeed else. recs=: 0;'Movie ID ',(":y.),' not found.' end. NB. Fail recs NB.EG recs=. getMovieRecs 1234 NB. ret code;4-row MURD mat. ) NB.* getUserRecs: get all Movie, User, Rating, Date records for user y.. getUserRecs=: 3 : 0 if. y. e. UUIDS do. vn=. (0 i.~y.>2{IXUV){0{IXUV VDIR unfileVar_WS_ varnm=. 'umurd',":vn vv=. ".varnm 4!:55 <varnm recs=. 1;vv#"1~y.=1{vv NB. Succeed else. recs=: 0;'User ID ',(":y.),' not found.' end. NB. Fail recs NB.EG recs=. getUserRecs 2451718 NB. ret code;4-row MURD mat. ) NB. onemore=: 13 : 'y.+.0,}:y.' NB.* onemore: or bool w/next bit NB. oneless=: 13 : 'y.+.}.y.,0' NB.* oneless: or bool w/prev bit rebuildUMURDfromURD=: 3 : 0 dd=. y. maxi=. 3-~2^20 NB. Max items (cols) per output var: more would cumval=. 4 0$0 NB. ->var size >16M; next size is 33M. outctr=. _1 NB. Output file number: increment before each use. for_ii. UMVIDS do. NB. For each unique movie id #... varnm=. ' '-.~'urd',":ii rc=. dd unfileVar_WS_ varnm cumval=. cumval,.ii,tmpval=. ".varnm NB. Movie atop User, Rating, Date 4!:55 <varnm if. maxi<1{$cumval do. cumval=. (-1{$tmpval)}."1 cumval NB. Drop piece that went over max outvar=. ' '-.~'umurd',":outctr=. >:outctr (outvar)=: cumval rc=. dd fileVar_WS_ outvar 4!:55 <outvar cumval=. ii,tmpval NB. Retain dropped piece for next var. end. end. if. 0<1{$cumval do. outvar=. ' '-.~'umurd',":outctr=. >:outctr (outvar)=: cumval rc=. dd fileVar_WS_ outvar 4!:55 <outvar end. #UMVIDS ) NB.* updateVarInfo: replace each (filed) var named after applying function. updateVarInfo=: 1 : 0 'dd varnm'=. y. NB. Vars dir, var names. rc=. dd unfileVar_WS_ varnm if. >{.rc do. (varnm)=: u. ".varnm rc=. dd fileVar_WS_ varnm [4!:55 <varnm end. rc NB. rcs=. sortUserVars updateVarInfo&>(<VDIR);&.>UVN ) sortUserVars=: 3 : 0 y.{"1~/:|:1 0{y. NB. Sort by user, movie ID. ) NB.* txt2MN2: text of "movie: num1 num2..." file->numeric (movie num);nums. txt2MN2=: 3 : 0 split ".&>(}:&.>' '-.~&.>0{y.) 0}y. NB.EG txt2MN2 '1:';'4';'3';'5';'5';'4';'4';'4';'3';'4';'5';'4';'5';'4';'3' ) numPerBucket=: 3 : 0 'bpv frqs'=. y. +/(frqs </}.bpv,>:>./frqs)*.frqs >:/ bpv ) NB.* bucketBreakpoints: set values at which to break up a range into equal-sized NB. buckets as much as possible, given number of buckets and frequencies of elements. bucketBreakpoints=: 3 : 0 'nb frqs'=. y. NB. Number of buckets, frequencies bp01=. nb%~i.nb bpi=. <.0.5+bp01*#frqs bpv=. bpi{/:~frqs NB. Simple way bpi3=. |:0>.(#frqs)<.bpi+"(0 1)_1 0 1 NB. or average 3 values. bpv=. 3%~+/bpi3{/:~frqs ) NB.* varUserRatings: find variance of ratings for each user: assume uAvgs global varUserRatings=: 3 : 0 if. -.nameExists 'VUR' do. VUR=: 0$~#UUIDS end. usix=. ~.allix=. UUIDS i. 1{y. usvar=. (1{y.) +//. *:(allix{uAvgs)-~2{y. NB. Accum sum of sq of diffs #VUR=: (usvar) usix}VUR NB.EG [varUserRatings&>(<'C:\Netflix\');&.>'uvar1';'uvar2' [ 4!:55 <'VUR' ) NB.* varMvRatings: find variance of ratings for each movie: assume mAvgs global varMvRatings=: 3 : 0 if. -.nameExists 'VMR' do. VMR=: 0$~#UMVIDS end. mvix=. ~.allix=. UMVIDS i. 0{y. mvvar=. (0{y.) +//. *:(allix{mAvgs)-~2{y. NB. Accum sum of sq of diffs #VMR=: (mvvar) mvix}VMR NB.EG [varMvRatings&>getVarInfo(<'C:\Netflix\');&.>'uvar1';'uvar2' [ 4!:55 <'VMR' ) NB.* meanUserRatings: find mean ratings for each user: total and count. meanUserRatings=: 3 : 0 if. -.nameExists 'MUR' do. MUR=: 0$~3,#UUIDS end. usrsum=. (1{y.) +//. 2{y. NB. Accumulate sums and usrct=. (1{y.) #/. 1{y. NB. count of items usrix=. UUIDS i. ~.1{y. usrsd=. (1{y.) stddev/. 2{y. NB. std dev of ratings #MUR=: (usrsum,usrct,:usrsd) usrix}"1 MUR NB.EG [meanUserRatings&>(<'C:\Netflix\');&.>'uvar1';'uvar2' [ 4!:55 <'MUR' ) NB.* meanMovieRatings: find mean ratings per movie: total, count, std dev. meanMovieRatings=: 3 : 0 if. -.nameExists 'MMR' do. MMR=: 0$~3,#UMVIDS end. mvrsum=. (0{y.) +//. 2{y. NB. Accumulate sums and mvrct=. (0{y.) #/. 0{y. NB. count of items mvrsd=. (0{y.) stddev/. 2{y. NB. std dev of ratings mvix=. UMVIDS i. ~.0{y. #MMR=: (mvrsum,mvrct,:mvrsd) mvix}"1 MMR NB.EG meanMovieRatings getVarInfo&>(<VDIR);&.>MVN' [ 4!:55 <'MMR' ) NB.* createProbeRatingFile: read file of movie: users -> movie: scores createProbeRatingFile=: 3 : 0 svprb=. >selProbeScores getVarInfo&.>(<VDIR);&.>MVN NB. 20061024: confirm that result is in same order as input... svprb=. 1{"1 svprb svprb=. >,.&.>/svprb svprb=. svprb{"1~/:0{svprb ) NB.* writeProbeRatingFile: write ratings file given MURD var. writeProbeRatingFile=: 3 : 0 NB. svprb=: createProbeRatingFile '' 'flnm prb'=. y. mvptn=. 1,2~:/\0{prb mvnums=. ':',~&.>":&.>mvptn#0{prb ratings=. ":&.>2{prb ratings=. mvptn<;._1 ratings outfl=. ;(mvnums,&.>LF),&.>ratings,&>&.>LF outfl=. (,outfl)-.' ' outfl fwrite flnm NB. writeProbeRatingFile (dd,'probeRating.txt');<createProbeRatingFile '' ) NB.* getProbeFile: get file (movie1: users; movie2: users...)->movies<;.1 users getProbeFile=: 3 : 0 NB. prb=: f2v dd,'probe.txt' prb=: f2v y. whmvid=. ':'e.&>prb mvids=. ".&>}:&.>whmvid#prb uipm=. ;&.>whmvid<;._1 ".&.>(-whmvid)}.&.>prb mvids;<uipm ) NB.* selProbeScores: given MURD var, look up movie, user pairs in global MVUS. selProbeScores=: 3 : 0 NB. MVUS=: mvids+&>1e_7*&.>uipm NB. See getProbeFile for "mvids" and "uipm" ykey=. 1 1e_7+/ . * 2{.y. NB. Combine movie, user into floating #. selmu=. ykey e. MVUS NB. MVUS=: MVUS#~-.MVUS e. ykey NB. Faster to avoid this data movement. selmu#"1 y. ) NB. Not used - too space-wasteful to combine movie, user into complex number. selectProbeRatings=: 3 : 0 (prum e.~ 1 0j1 +/ . *2{.y.)#"1 y. ) NB.* userRatings: get user ratings from 1 var. userRatings=: 3 : 0 gv=. /:1{y. y.=. gv{"1 y. ptn=. 1,2~:/\1{y. ratings=. ptn<;.1 ] 2{y. ) NB.* ordVars: order variables and re-save. ordVars=: 4 : 0 rc=. x. unfileVar_WS_ y. if. >{.rc do. var=. ".y. (y.)=: (/:1{var){"1 var rc=. x. fileVar_WS_ y. if. >{.rc do. 4!:55 <y. else. smoutput 'Could not save ','.',~y. end. else. smoutput 'Could not read ','.',~y. end. rc ) NB. gv=. /:1{y. NB. y.=. gv{"1 y. NB.* userStats: statistics on user ratings. userStats=: 3 : 0 ptn=. 1,2~:/\1{y. umdstats=. ((#,<./,>./)&>ptn<;.1 ] 2{y.),.(<./,>./)&>ptn<;.1 ] 3{y. ums=. (mean,stddev)&>ptn<;.1]2{y. umdstats;ums ) NB.* reordM2UFiles: re-order MURD (Movie, User, Rating, Date) files by User (or other row). reordM2UFiles=: 3 : 0 'indir ovars ivars uids'=. y. NB. Vars dir, out var names, in var uids=. /:~uids NB. names, unique IDs to sort on. if. -.nameExists 'WR' do. WR=. 1 end. NB. Sort on Which Row? ix=. <.0.5+(nf%~<:#uids)*>:i.nf=. #ovars NB. Partition ids equally maxepp=. ix{uids NB. Maximum Element Per Partition for_mvnm. ivars do. mvnm=. >mvnm NB. Read all movie files for each rc=. indir unfileVar_WS_ mvnm NB. group of users. if. >{.rc do. mv=. ".mvnm whptn=. +/maxepp </ WR{mv NB. Partition index/item for_ptn. /:~~.whptn do. NB. Do each user partition rc=. indir unfileVar_WS_ uvnm=. >ptn{ovars if. >0{rc do. (uvnm)=: (".uvnm),.(I. ptn=whptn){"1 mv if. >0{rc=. indir fileVar_WS_ uvnm do. 4!:55 <uvnm else. smoutput rc,<'unfileVar_WS_ ',uvnm end. else. smoutput rc,<'fileVar_WS_ ',uvnm end. end. else. smoutput rc end. 4!:55 <mvnm end. ) NB.* initUserKeyedFiles: initialize user-keyed files for reordM2UFiles. initUserKeyedFiles=: 3 : 0 'dd flprfx nf'=. y. NB. Dir, file (var) prefix, number of files to init. nms=. ' '-.~&.>(<flprfx),&.>":&.>i.nf ".&.>nms,&.><'=: 4 0$0' (<dd) fileVar_WS_&.>nms [4!:55&><"0 nms nms NB.EG nnms=. initUserKeyedFiles VDIR;'nwMURD';100 ) NB.* loadMMF: attempt load of memory-mapped file: runs out of room>34e6 items. loadMMF=: 3 : 0 'indir fnm'=. y. fnm=. indir,fnm,'.mmf' createjmf_jmf_ fnm;16*NUMITEMS NB. 16 = 4 ints/item * 4 bytes/int JINT map_jmf_ 'mmurds';fnm mmurds=: 0$~NUMITEMS,4 flpfx=. 'MURD' fls=. {."1 jfi dir indir,flpfx,'*.DAT' strow=. 0 NB. NUMITEMS rows x 4 cols: start at row 0 for_fl. fls do. fl=. >fl vflnm=. fl{.~fl i. '.' rc=. indir unfileVar_WS_ vflnm if. >{.rc do. len=. 1{$".vflnm mmurds=: (|:".vflnm) (strow+i.len)}mmurds strow=. strow+len [4!:55 <vflnm else. smoutput 'Failed to read var "','".',~vflnm end. end. fnm;<$mmurds ) NB. unmapall_jmf_ '' [ unmap_jmf_ fnm NB.* getVarInfo: apply arbitrary function to each (filed) var named. getVarInfo=: 1 : 0 'dd varnm'=. y. NB. Vars dir, var names. rc=. dd unfileVar_WS_ varnm if. >{.rc do. rc=. 1;u. ".varnm [4!:55 <varnm end. rc NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3' NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN ) NB.* output1Piece: put out one piece of normalize (DB-formed) data: var "MURD". output1Piece=: 3 : 0 'indir dat'=. y. [ flpfx=. 'MURD' fls=. {."1 jfi dir indir,flpfx,'*.DAT' fnums=. ".&>(#flpfx)}.&.>fls{.~&.>fls i.&.>'.' if. 0=#fnums do. fnum=. 0 else. fnum=. >:>./fnums end. NB. mnm=. 'MURD',' '-.~":<:<.flctr%blksz NB. Number block mnm=. 'MURD',' '-.~":fnum NB. Number block (mnm)=: dat if. >{.indir fileVar_WS_ mnm do. NB. Remove vars if successfully 1 [ 4!:55 &.><&.>'MURD';<mnm NB. filed. else. 0 [ smoutput 'Failed to file "','".',~mnm end. ) NB.* normizeTSData: normalize training set data->Movie num, User, Rating, Date. normizeTSData=: 3 : 0 'indir blksz'=. y. indir=. endSlash indir [ inpfx=. 'urd' fls=. {."1 jfi dir indir,inpfx,'*.dat' for_flctr. i.#fls do. fl=. >flctr{fls mnum=. ".(#inpfx)}.flpfx=. fl{.~fl i. '.' rc=. indir unfileVar_WS_ tolower flpfx if. >{.rc do. varnm=. >2{>1{rc if. 0=blksz|flctr do. NB. Break output into blocks to if. nameExists <'MURD' do. NB. avoid memory problems output1Piece indir;<MURD end. MURD=: 4 0$0 NB. Start new block. end. MURD=: MURD,.mnum,".varnm [4!:55 <varnm NB. Erase var else. smoutput 'Could not unfile "','".',~flpfx end. end. if. 0<#MURD do. output1Piece indir;<MURD end. ) NB.* varizeTSFiles: convert training set files into J vars and save to file. varizeTSFiles=: 3 : 0 'indir outdir'=. endSlash&.>y. fls=. {."1 jfi dir indir,'mv*.txt' szs=. 0$~#fls duperatings=. 0$~#fls for_flctr. i.#fls do. fl=. >flctr{fls mvdata=. mungeTrainingSet indir,fl if. >{.mvdata do. 'rc mnum urd'=. mvdata (newnm=. 'urd',":mnum)=: urd if. >{.outdir fileVar_WS_ newnm do. szs=. (howbig 'urd') flctr}szs un=. #&>~.&.><"1 urd duperatings=. (({.un)~:1{$urd) flctr}duperatings else. smoutput 'Could not file ','.',~newnm end. [4!:55 <newnm else. smoutput 'Problem munging ','.',~fl end. end. szs;<duperatings NB.EG 'C:\Data\Netflix\training_set\' varizeTSFiles 'C:\Data\Netflix\VarsDir\' ) howbig=: 7!:5@:< NB.* howbig: Size of variable given its name. NB. Pair of date converters ('YYYY-MM-DD'->YYYYMMDD) - first is more NB. careful (doesn't assume 2-digit months and days); second is much faster. cvtNDt=: 13 : '".;_4 _2 _2{.&.>(<''00''),&.>dsp&.><;._1 ''-'',y.' cvtNDt2=: 13 : '".y.-.''-''' NB.* mungeTrainingSet: convert a training set text file into vars. mungeTrainingSet=: 3 : 0 'mnum urd'=. split f2v y. mnum=. ".':'-.~;mnum urd=. <;._1 &> ',',&.>urd urd=. urd#~0 +./ .~: |:#&>urd NB. 1.6304726 <-> 6!:2 'ad2=. cvtNDt2&> 2{"1 urd' NB. 12.787668 <-> 6!:2 'ad=. cvtNDt&> 2{"1 urd' NB. 1 <-> ad-:ad2 ad2=. cvtNDt2&> 2{"1 urd ad=. cvtNDt&> 2{"1 urd NB. This tells us if there is an inconsistency in the YYYY-MM-DD date format. if. -. ad-:ad2 do. rc=. 0;'Date mismatch';ad;<ad2 return. end. ivn=. -.|:isValNum&>2{."1 urd whbad=. ,(i.#ivn),&.>I."1 ivn if. 0~:#whbad do. rc=. 0;('Invalid number','s'#~1~:#whbad);<whbad return. end. urd=. ad,~|:".&>2{."1 urd 1;mnum;<urd NB. Good return; Movie Number; User number, Rating, Date NB.EG mvdata=. mungeTrainingSet '\Data\Netflix\training_set\mv_0005317.txt' ) NB.* mungeMovieTitles: put movie titles file into an array. mungeMovieTitles=: 3 : 0 NB. Put each LF-delimited line into a cell, then break apart cells based on NB. comma delimiters. movies=. ><;._1&.>',',&.><;._1 LF,fread y. NB. After breaking columns on commas, must fix titles w/embedded commas NB. so we can put whole title into single column. titonly=. 2}."1 movies titonly=. ;&.> <"1 (0~:#&>titonly)#&.>',',~&.>dsp&.>titonly-.&.>CR titonly=. }:&.>titonly wh=. I.','+./ . e.&>titonly NB. Find titles with embedded commas. titonly=. ((<',';', ') replace&.>wh{titonly) wh}titonly movies=. }:(2{."1 movies),.titonly mvhdr=. 'num';'year';'title' NB. Convert numeric fields to numbers: mm=. |:".&.>2{."1 movies wh=. I. 1~:#&>1{mm NB. Where are bad year numbers? mm=. |:2{."1 movies mm=. ((<'0') wh}1{mm)1}mm NB. Bad year->0 mnhdr=. 2{.mvhdr [ mn=. ".&>mm titles=. movies{"1~mvhdr i. <'title' mnhdr;mn;<titles NB. Movie nums header, nums, titles NB.EG mungeMovieTitles 'C:\Data\Netflix\movie_titles.txt' NB.EG 'mnhdr mn titles'=. mungeMovieTitles dd,'movie_titles.txt' ) == Examples of Using the Code == Here we show how we can manipulate the data. <pre> I.','+./ . e.&>titonly 71 263 349 365 393 465 581 599 669 671 728 775 826 833 890 912 943 972 1009 1014 1057 1094 1169 1209 1285 1519 1558 1628 1646 1653 1657 1673 1744 1751 2106 2175 2177 2360 2409 2427 2432 2461 2597 2624 2627 2690 2721 2771 2795 2829 2853 2877 2888 2946 3022 ... 71 263 349{titonly +------------------------------------------------+-----------------------------------------+----------------------------------+ |At Home Among Strangers,A Stranger Among His Own|Angelina Ballerina: Lights,Camera,Action!|Dr. Quinn,Medicine Woman: Season 3| +------------------------------------------------+-----------------------------------------+----------------------------------+ wh=. I.','+./ . e.&>titonly titonly=. ((<',';', ') replace&.>wh{titonly) wh}titonly 71 263 349{titonly +-------------------------------------------------+-------------------------------------------+-----------------------------------+ |At Home Among Strangers, A Stranger Among His Own|Angelina Ballerina: Lights, Camera, Action!|Dr. Quinn, Medicine Woman: Season 3| +-------------------------------------------------+-------------------------------------------+-----------------------------------+ $movies=. }:(2{."1 movies),.titonly 17770 3 _5{.movies +-----+----+----------------------------------------------------------+ |17766|2002|Where the Wild Things Are and Other Maurice Sendak Stories| +-----+----+----------------------------------------------------------+ |17767|2004|Fidel Castro: American Experience | +-----+----+----------------------------------------------------------+ |17768|2000|Epoch | +-----+----+----------------------------------------------------------+ |17769|2003|The Company | +-----+----+----------------------------------------------------------+ |17770|2003|Alien Hunter | +-----+----+----------------------------------------------------------+ (10{.wh){movies +---+----+-------------------------------------------------+ |72 |1974|At Home Among Strangers, A Stranger Among His Own| +---+----+-------------------------------------------------+ |264|2002|Angelina Ballerina: Lights, Camera, Action! | +---+----+-------------------------------------------------+ |350|1993|Dr. Quinn, Medicine Woman: Season 3 | +---+----+-------------------------------------------------+ |366|2004|Still, We Believe: The Boston Red Sox Movie | ... |600|1966|What's Up, Tiger Lily? | +---+----+-------------------------------------------------+ |670|2002|He Loves Me, He Loves Me Not | +---+----+-------------------------------------------------+ |672|1991|He Said, She Said | +---+----+-------------------------------------------------+ wh=. I. 1~:#&>1{mm wh{movies +-----+----+-------------------------------------------+ |4388 |NULL|Ancient Civilizations: Rome and Pompeii | +-----+----+-------------------------------------------+ |4794 |NULL|Ancient Civilizations: Land of the Pharaohs| +-----+----+-------------------------------------------+ |7241 |NULL|Ancient Civilizations: Athens and Greece | +-----+----+-------------------------------------------+ |10782|NULL|Roti Kapada Aur Makaan | ... |15918|NULL|Hote Hote Pyaar Ho Gaya | +-----+----+-------------------------------------------+ |16678|NULL|Jimmy Hollywood | +-----+----+-------------------------------------------+ |17667|NULL|Eros Dance Dhamaka | +-----+----+-------------------------------------------+ NB. Jimmy Hollywood 1994; Roti Kapada Aur Makaan 1974; Hote Hote Pyaar Ho Gaya 1999 )
Advanced topics
The group discussed techniques to explore for working with this data to make forecasts.
Learning and Teaching J
We discussed whether J has enough beginner material or if it could be improved. The Beginner's Regatta section of our meetings is oriented toward beginners as a way to provide more points of entry into J.
There are three types of information: * Need to know * Nice to know * Get a life - Shadow Wrought (586631)