NYCJUG/2006-11-14

From J Wiki
Jump to navigation Jump to search

Beginner's regatta

The full rules for the Netflix Challenge cover about 10 pages so we will not go into detail on them but anyone interested can look here. We are most concerned initially with the size of the datasets and what they look like.

Netflix Challenge Dataset Overview

Data Item # Records
Movies 17,700
Customers 480,189
Training records 100,480,507
Probe records 1,408,395

Graphical View of Datasets

We look at plots of statistics about the datasets.

Year of Movie

Here we see the number of movies in the database by year.

PlotNumberOfMoviesByYearInNetflixData.jpg

Rating per Movie

Here is the distribution of the ratings by movie. The axis is a log scale but the lower axis translates the logs back to original values.

HistogramOflnRatingPerMovie.jpg

Rating per User

Here is the distribution of the ratings by movie viewer. Again the axis is ln but the green numbers above it give the mapping back to original values.

HistogramOfRatingsByUsers.jpg

RMSE Details of Some Baseline Forecasts

Here we compare the RMSE (root mean squared error) of some simple forecasting methods based on weighted averages of the ratings by customer and by movie.

PerMovieRMSEfor3Methods.jpg

Data Handling Code

Here is an attempt to process data more quickly by implementing a simple caching scheme using variables written to file.

This code is obsolete for a couple of reasons. One reason is that it was written for an older version of J in which the default right and left arguments, x and y, were written with a final dot, or as x. and y.. The other reason is that this dataset was large enough in 2006 that it was necessary to break it into pieces in order to fit it into memory. With contemporary machines, ca. 2022, the data used in pieces can be combined into a single array for each set of data.

It is important to remember that the amount of data in the NC dataset was a bit large for most home PCs, with perhaps one or two gigabytes of RAM, which is why we are forced to treat it in pieces.

So, since the data was too large to keep all in memory at once, we try caching it by file name where the files are arbitrarily large chunks of the data sorted either by movie or customer. The files each hold a J variable that is instantiated in the current namespace [using the WS code found here. These are pieces of a larger array to keep down memory usage.

NB.* fileBlocks.ijs: control access to file by caching and pre-processing blocks of text.

coclass 'FB'
coinsert 'fldir' [ load 'filefns'
coinsert 'base'

NB.* FWPI: global file name, value cache.
NB.* BLKSZ: arbitrary amount of text to read in.
NB.* rmFileEntry: remove entry from global FWPI by file name.
NB.* getNextFileBlock: get next movie: numbers data block for named file.
NB.* readNewBlocks: pre-process and cache text to numbers, feeding next datum.
NB.* txt2MN2: boxed chars 'movie num:';'num1';'nums'... to numeric mvnum;nums

BLKSZ=: 16000                 NB.* BLKSZ: arbitrary amount of text to read in.

NB.* rmFileEntry: remove entry from global FWPI by file name.
rmFileEntry=: 3 : 0
   y.=. boxopen y.
   if. nameExists 'FWPI' do.
       if. y. e. 0{FWPI do.        NB. If we know about this file
           wh=. y. i.~0{FWPI
           FWPI=: (<<<wh){"1 FWPI  NB. Remove entry
       end.
   end.
   1                               NB. Always succeed if completed
)

NB.* getNextFileBlock: get next movie: numbers data block for named file.
getNextFileBlock=: 3 : 0
NB.   flnm=. dd,'probeRatings.txt'
   y.=. boxopen y.
   if. -.nameExists 'FWPI' do.     NB. File Blocks
       FWPI=: 4 1$(3{.y.),<0       NB. File name, Whole, Partial blocks, Index
   end.
   if. y. e. 0{FWPI do.            NB. If we know about this file
       wh=. y. i.~0{FWPI
       cblk=. >1{wh{"1 FWPI        NB. Get complete blocks
       if. 0=#cblk do.             NB.  if any or,
           cblk=. readNewBlocks wh;y.;BLKSZ
       end.
   else.                           NB. Start tracking new file.
       wh=. 1{$FWPI                NB. New entry goes at end.
       FWPI=: FWPI,.(3{.y.),<0
       cblk =. readNewBlocks wh;y.;BLKSZ
   end.
   FWPI=: (<}.cblk) (<1,wh)}FWPI
   if. 0<#cblk do. ret=. 0{cblk else. '' end.
NB.EG prbmu=. getNextFileBlock_FB_ 'C:\Netflix\probe.txt'
NB.EG prbrtg=. getNextFileBlock_FB_ 'C:\Netflix\probeRatings.txt'
)

t1_getNextFileBlock_test_=: 3 : 0
   BLKSZ=: 12 [ svbs=. BLKSZ
   tmpfl=. 'gNFB.tst',~getTempDir ''
   tst=. '1:',LF,'111',LF,'2:',LF,'211',LF,'222',LF,'3:',LF,'311',LF
   tst=. tst,'322',LF,'333',LF
   tst fwrite <tmpfl
   assert. (1;,111)-:getNextFileBlock tmpfl
   assert. (2;,211 222)-:getNextFileBlock tmpfl
   assert. (3;,311 322 333)-:getNextFileBlock tmpfl
   BLKSZ=: svbs
)

NB.* readNewBlocks: pre-process and cache text to numbers, feeding next datum.
readNewBlocks=: 3 : 0
   'wh flnm bs'=. y.
   flnm=. openbox flnm
   fsz=. fsize >0{wh{"1 FWPI       NB. Account for file size to
   st=. >3{wh{"1 FWPI              NB. Start from last point
   bs=. bs<.fsz-st                 NB. Reduce blocksize if necessary.
   fblk=. 0 [ cblk=. ''            NB. Filled block flag; default empty block
   eofl=. 0=bs                     NB. End-of-file flag
   while. -.eofl+.fblk do.         NB. End of file or full block?
       pt=. >2{wh{"1 FWPI          NB. Use any left-over partial text.
       blk=. 1!:11]flnm;st,bs      NB. Read from file.
       lfl=. blk i: LF             NB. Last Full Line
       st=. lfl+st                 NB. New starting point in file at line
       FWPI=: (<st) (<3,wh)}FWPI
       if. st>:<:fsize flnm do.    NB. End-of-file or near end
           if. LF~:{:blk do. lfl=. >:lfl end.
           blk=. pt,lfl{.blk       NB. Partial text and all remaining
           FWPI=: (<'') (<2,wh)}FWPI         NB. No more partial text
           FWPI=: (<fsize flnm) (<3,wh)}FWPI NB. At end-of-file
           eofl=. 1
       else.
           blk=. pt,lfl{.blk                 NB. Use partial text from previous
           eofb=. LF i:~blk{.~blk i: ':'     NB. Find end of full block
           FWPI=: (<eofb}.blk) (<2,wh)}FWPI  NB. Save remaining partial
           blk=. eofb{.blk
       end.
       blk=. blk-.CR,' '
       mvrt=. <;._1 (LF#~LF~:{.blk),blk      NB. each line->box
       cblk=. txt2MN2&>(':' e.&>mvrt)<;.1 mvrt    NB. Num,':' starts new block.
       fblk=. 0<#cblk
   end.
   cblk
NB.EG mvnums=. readNewBlocks 1;'C:\Netflix\probeRatings.txt';16000
)

NB.* txt2MN2: boxed chars 'movie num:';'num1';'nums'... to numeric mvnum;nums
txt2MN2=: 3 : 0
   y.=. y-.&.><' :',CR
   split ,".&>(0{y.) 0}y.
NB.EG txt2MN2 '1:';'4';'3';'5';'5'
)

coclass 'base'

Show-and-tell

Here we look at two approaches for manipulating the large datasets for the Netflix challenge. First we will look at what John Randall came up with.

John's Preliminary Code for the Netflix Challenge

Here is what John has so far. He starts out with some documentation on how to use the code. You can see that he addresses the difficulty of keeping these datasets in memory by using memory-mapped files.

NB.* JRNflixChallenge.ijs: John Randall's Netflix Challenge data manipulation

0 : 0
Initial attempt at loading the data.

The lines

   loadtraining NMOVIES
   loadprobe ''

will load the training and probe data.

There are also some utilities for dealing with mapped files.

The data files are written to the directory JDATA, and are as follows.

m,c,r: lists of the movie, customer and ratings in the training set.  Date is currently ignored.

mri: Since the movies in the training set are in order, this gives run starts and lengths.

pi: an integer list, representing the  probe data, where i{pi is the index of the ith probe datum in the training data.
)

Next he defines some basic globals.

9!:7 '+++++++++|-'            NB. Set box-drawing characters to ASCII
ts=: 6!:2, 7!:2@]             NB. Report time and space used by expression

require 'files jmf'

Then he sets up globals to define locations of the datasets.

NB. Netflix download file paths
setJRPaths=: 3 : 0
   DOWNLOAD=: ('/',PATHSEP_j_) replace y.
   QUAL=:DOWNLOAD,'qualifying.txt'
   PROBE=:DOWNLOAD,'probe.txt'
   TRAINING=:DOWNLOAD,'training_set',PATHSEP_j_
NB.EG setJRPaths '/home/john/netflix/download/'
)
   
setJRPaths '\Data\Netflix\'

He shows us some statistic and sets up utilities for memory-mapped files.

NB. Data file statistics
NMOVIES=:17770       NB. # of movies
NCUST=:480189        NB. # of customers
NTRAINING=:100480507 NB. # of records in training db
NPROBE=:1408395      NB. # of records in probe db

NB. Directory root where mapped files are stored.
JDATA=: DOWNLOAD,'jdata',PATHSEP_j_

NB. Utilities for mapped files.
NB. All refer to JDATA directory
NB. map 'name': read only access.
NB. This is our default access, unlike the jmf conventions.
map=:[: map_jmf_ ] ; ('';1) ;~ JDATA , ]
unmap_z_=:unmap_jmf_
unmapall_z_=:unmapall_jmf_

NB. <array> writejmf <filename>
NB. This is the only way we create mapped files at the user level.
NB. They are subsequently opened read-only.
writejmf=:4 : 0
   data=.x
   fn=.JDATA,y
   size=.4* */$ data
   createjmf_jmf_  fn;size
   map_jmf_ 'JMF';fn
   JMF=:0 $ ~ 0,}. $ data
   JMF=:data
   unmap_jmf_ 'JMF'
)

NB. Array access in mapped file: from={
NB. <indices> from <mapped file name>
NB. Written with variable JMF to avoid collisions.
from=:4 : 0
   map_jmf_ 'JMF';(JDATA,y);'';1
   r=.x { JMF
   unmap 'JMF'
   r
)

He defines some utilities for run-length encoding and loading data.

NB. runs y
NB. Utility for indices in run length encoding.
NB. If y is a numeric list, runs y
NB. returns pos,.len
NB. where i{pos is the beginning of the ith run
NB. and i{len is its length.
NB. The identity y-:len # pos { y should hold
runs=:(}:,.2&(-~/\))@:I.@:(1,1,~ 2&(~:/\))

NB. x lz y: pad y with leading zeros to size x
lz=:13 : '((x-#y)#''0''),y'

NB. tfn y gives full file name for movie y
tfn=:13 : 'TRAINING,''mv_'',(7 lz ": y),''.txt'''

NB. training y returns table m,c,r for movie y
NB. At present, we are ignoring dates.
training=:13 : 'y,"0 1 ".&> 0 2{"1 ;: }. ''m'' freads tfn y'

Finally, he defines the data loaders.

NB. loadtraining y creates training data for first y movies.
NB. loadtraining NMOVIES creates full set of training data.
NB. Arrays are assembled in memory, then written.
NB. Could be faster, but only called once.
loadtraining=:3 : 0
   m=.0$0
   c=.0$0
   r=.0$0
   for_i. >:i.y do.
       if. 0=100|i do. smoutput i end.
       t=.training i
       m=.m,{."1 t
       c=.c,1{"1 t
       r=.r,{:"1 t
   end.
   m writejmf 'm'   NB. movie data
   c writejmf 'c'   NB. customer data
   r writejmf 'r'   NB. ratings data
   NB. movie run length index
   map 'm'
   (runs m) writejmf 'mri'
   unmap 'm'
)
   
NB. Load probe data: loadprobe ''
loadprobe=:3 : 0
   probe=.<;._2 freads PROBE
   start=.(':'={:)&> probe
   c=.0&".&>&.> start <;._1 probe
   m=.(".@:}:)&>start # probe
   p=.;m (4 : 'x&,"0 &.>y')"0 c
   NB. p writejmf JDATA,'p'
   M=.{."1 p
   C=.{:"1 p
   mfret=.1,2&(~:/\) M
   NB. assert. (mfret#M)-:(~.M)
   mnub=.mfret#M
   cbox=.mfret <;.1 C
   NB. Now use loop to minimize memory
   r=.0$0
   for_i. i.#mnub do.
       m=.i{mnub
       indices=.({.+i.@{:)(<:m) from 'mri'
       r=.r, indices {~ (indices from 'c')i. >i { cbox
   end.
   r writejmf 'pi'
)

Devon's Preliminary Code

This code is largely concerned with pulling the information out of the text files in the Netflix Challenge dataset to create useful variables in J. It uses the WS functions, as found here, to break large datasets into sequentially numbered variables - like murd0 to murd99 - which are kept on file and loaded as needed.

This code, like the fileBlocks.ijs code above, is obsolete for the same reasons.

With those caveats, here is some old, obsolete code for your enjoyment though much of the explanation about the data is still valid.

NB.* mungeNetflixData.ijs: get Netflix data files into J vars.
TheUsual=: 0 : 0
load (dd=. 'C:\Data\Netflix\'),'mungeNetflixData.ijs'
load dd,'mungeNetflixData.ijs'
)

NB.* VDIR: variables-on-file directory
NB.* UVN: User var names
NB.* MVN: Movie var names
NB.* NUMITEMS: number of data items (MURD as below).

NB.* rmse: root-mean squared error: fit criterion between 2 ratings series.
NB.* baseCMAvgsRate: base case rater: use mean movie & cust rating as guess.
NB.* baseCAvgRate: base case rater: use mean cust rating regardless of movie.
NB.* baseMAvgRate: base case rater: use mean movie rating regardless of cust.
NB.* makeRatingsFile: apply a rating function (movie rate custs)->ratings->file
NB.* buildFVIndex: build index into vars on file.
NB.* putNextFileBlock: put next movie: numbers data block at end of named file.
NB.* startRRMCluster: start clustering users based on rarely-rated movies.
NB.* rarestRated: find movies with fewest ratings.
NB.* getMovieRecs: get all Movie, User, Rating, Date records for movie y..
NB.* getUserRecs: get all Movie, User, Rating, Date records for user y..
NB.* onemore: or bool w/next bit
NB.* oneless: or bool w/prev bit
NB.* updateVarInfo: replace each (filed) var named after applying function.
NB.* txt2MN2: text of "movie: num1 num2..." file->numeric (movie num);nums.
NB.* bucketBreakpoints: set values at which to break up a range into equal-sized
NB.* varUserRatings: find variance of ratings for each user: assume uAvgs global
NB.* varMvRatings: find variance of ratings for each movie: assume mAvgs global
NB.* meanUserRatings: find mean ratings for each user: total and count.
NB.* meanMovieRatings: find mean ratings per movie: total, count, std dev.
NB.* createProbeRatingFile: read file of movie: users -> movie: scores
NB.* writeProbeRatingFile: write ratings file given MURD var.
NB.* getProbeFile: get file (movie1: users; movie2: users...)->movies<;.1 users
NB.* selProbeScores: given MURD var, look up movie, user pairs in global MVUS.
NB.* userRatings: get user ratings from 1 var.
NB.* ordVars: order variables and re-save.
NB.* userStats: statistics on user ratings.
NB.* reordM2UFiles: re-order MURD (Movie, User, Rating, Date) files by User (or other row).
NB.* initUserKeyedFiles: initialize user-keyed files for reordM2UFiles.
NB.* loadMMF: attempt load of memory-mapped file: runs out of room>34e6 items.
NB.* getVarInfo: apply arbitrary function to each (filed) var named.
NB.* output1Piece: put out one piece of normalize (DB-formed) data: var "MURD".
NB.* normizeTSData: normalize training set data->Movie num, User, Rating, Date.
NB.* varizeTSFiles: convert training set files into J vars and save to file.
NB.* howbig: Size of variable given its name.
NB.* mungeTrainingSet: convert a training set text file into vars.
NB.* mungeMovieTitles: put movie titles file into an array.

coinsert 'fldir' [ load 'filefns'
load 'jmf stats files dir mystats'
dd=: 'C:\Data\Netflix\'
load dd,'fileBlocks.ijs'

NB.* "MURD[n]" variables are Movie, User, Rating, Date (all ints) tuplets as
NB. 4-row mats broken into (about 100) usefully small pieces.  These vars
NB. are saved in files using the "WS" namespace function "fileVar"; they are
NB. retrieved using "unfileVar". "umurd[n]" variables are the same info but
NB. grouped primarily by user whereas "MURD[n]" are grouped by movie.

VDIR=: '\Data\Netflix\VarsDir\'    NB.* VDIR: variables-on-file directory
NUMITEMS=: 100480507     NB.* NUMITEMS: number of data items (MURD as below).

NB. "UVN" and "MVN" are namelists of "Movie, User, Rating, Date" vars broken 
NB. into <:16MB vars on file.  Both var sets have same data but grouped by 
NB. user or movie.
UVN=: flpfx&.>tolower&.>{."1 jfi dir VDIR,'umurd*.dat' NB.*UVN: User var names
nn=. ".&>UVN#~&.>UVN e.&.><DIGITS
UVN=: UVN{~/:nn                    NB. Put in numeric order.
MVN=: flpfx&.>tolower&.>{."1 jfi dir VDIR,'murd*.dat'  NB.*MVN: Movie var names
nn=. ".&>MVN#~&.>MVN e.&.><DIGITS
MVN=: MVN{~/:nn                    NB. Put in numeric order.

NB. Get vars w/unique user ids, movie ids, start/stop MURD vars (UVN and MVN).
(<VDIR) unfileVar_WS_&.>'UMVIDS';'UUIDS';'IXMV';'IXUV'

NB.* rmse: root-mean squared error: fit criterion between 2 ratings series.
rmse=: 4 : '%:(#y.)%~+/*:x.-y.'

NB.* baseC65M35AvgsRate: base case rater: rate using 35% movie & 65% cust rating.
baseC65M35AvgsRate=: 4 : 0
NB. Ensure these four vars exist before running this for performance.
NB.   if. -.nameExists 'MWT' do. MWT=: 0.35 end.
NB.   if. -.nameExists 'CWT' do. CWT=: 0.65 end.
NB.   if. -.nameExists 'UMS' do. VDIR unfileVar_WS_ 'UMS' end.
NB.   if. -.nameExists 'MMS' do. VDIR unfileVar_WS_ 'MMS' end.
   cavgs=. (0{UMS){~UUIDS i. y.         NB. Average user rating
   mavgs=. (#y.)$(0{MMS){~UMVIDS i. x.  NB. Average movie rating
   (CWT*cavgs)+MWT*mavgs                NB. Weighted average of 2.
)

NB.* baseCMAvgsRate: base case rater: use mean movie & cust rating as guess.
baseCMAvgsRate=: 4 : 0
NB. Ensure these two exist before running this for performance.
NB.   if. -.nameExists 'UMS' do. VDIR unfileVar_WS_ 'UMS' end.
NB.   if. -.nameExists 'MMS' do. VDIR unfileVar_WS_ 'MMS' end.
   cavgs=. (0{UMS){~UUIDS i. y.         NB. Average user rating
   mavgs=. (#y.)$(0{MMS){~UMVIDS i. x.  NB. Average movie rating
   -:cavgs+mavgs                        NB. Simple average of 2.
)

NB.* baseCAvgRate: base case rater: use mean cust rating regardless of movie.
baseCAvgRate=: 4 : 0
   if. -.nameExists 'UMS' do. VDIR unfileVar_WS_ 'UMS' end.
   (0{UMS){~UUIDS i. y.       NB. Guess is average user rating.
)

NB.* baseMAvgRate: base case rater: use mean movie rating regardless of cust.
baseMAvgRate=: 4 : 0
   if. -.nameExists 'MMS' do. VDIR unfileVar_WS_ 'MMS' end.
   (#y.)$(0{MMS){~UMVIDS i. x.     NB. Guess is average movie rating.
)

NB.* makeRatingsFile: apply a rating function (movie rate custs)->ratings->file
makeRatingsFile=: 1 : 0
   'infl outfl dtflg'=. 3{.y.,<'0'  NB. Some files have a date to be removed.
   recctr=. 0
   rmFileEntry_FB_ infl
   while. 0~:#fb=. getNextFileBlock_FB_ infl do.
       'mv usrs'=. fb
       if. dtflg do. usrs=. usrs#~0 1$~$usrs end.
       ratings=. 0.1 roundNums mv u. usrs
       (mv,ratings) putNextFileBlock outfl
       recctr=. >:recctr
   end.
   recctr
NB.EG rc=. baseCMAvgsRate makeRatingsFile (dd,'probe.txt');(dd,'rCMp.txt');0
)

NB.* buildFVIndex: build index into vars on file.
buildFVIndex=: 3 : 0
   stend=. ({."1,.{:"1) getVarInfo&>y.
   if. *./rc=. >{."1 stend do.
       3{.>,.&.>/(i.#stend),&.>1{"1 stend
   else. (-.rc)#1{"1 stend end.
NB.EG buildFVIndex (<VDIR);&.>MVN
)

NB.* putNextFileBlock: put next movie: numbers data block at end of named file.
putNextFileBlock=: 4 : 0
   (;LF,~&.>(':',~":{.x.);":&.>}.x.) fappend y.
NB.EG putNextFileBlock_FB_ 
)

NB.* makeProbeAnswerFile: given probe file (movie: customers...) return ratings
makeProbeAnswerFile=: 3 : 0
   'infl outfl'=. y.
   rmFileEntry_FB_ infl [ ferase outfl            NB. Start at beginning
   recctr=. 0
   while. 0~:#fb=. getNextFileBlock_FB_ infl do.
       'mv usrs'=. fb
       'rc mu'=. getMovieRecs mv
       (mv,(usrs i.~ 1{mu){2{mu) putNextFileBlock outfl
       recctr=. >:recctr
   end.
   recctr
)

NB.* startRRMCluster: start clustering users based on rarely-rated movies.
startRRMCluster=: 3 : 0
NB. MMR: Mean movie ratings: 0{sum of ratings; 1{# ratings; 2{std dev ratings.
   if. -.nameExists 'MMR' do. VDIR unfileVar_WS_ 'MMR' end.
   'n2r outpfx'=. y.
   for_ii. i.#UVN do.
       VDIR unfileVar_WS_ vn=. 'umurd',":ii
       outvn=. outpfx,":ii
       (outvn)=: n2r rarestRated ".vn
       VDIR fileVar_WS_ outvn
       4!:55&><&.>vn;outvn
   end.
NB. startRRMCluster 20;'rr20mc'
)

NB.* rarestRated: find movies with fewest ratings.
rarestRated=: 4 : 0
   uptn=. 1,2~:/\1{y.         NB. y. is murd var sorted by 1{
   mrbu=. uptn<;.1 ] |:0 2{y. NB. Movie & rating by user
NB. movie popularity: low count means movie rarely rated.
   mp=. ((<UMVIDS)i.&.>0{"1&.>mrbu){&.><1{MMR     
   nm=. x.<.#&>mrbu           NB. x <. Number of movies rated by user
   lrix=. nm{.&.>/:&.>mp      NB. Indexes of some of the most rarely rated.
   (<"0 uptn#1{y.),.|:&.>lrix{&.>mrbu
)

NB.* getMovieRecs: get all Movie, User, Rating, Date records for movie y..
getMovieRecs=: 3 : 0
   if. y. e. UMVIDS do.
       vn=. (0 i.~y.>1{IXMV){0{IXMV
       VDIR unfileVar_WS_ varnm=. 'murd',":vn
       vv=. ".varnm
       4!:55 <varnm
       recs=. 1;vv#"1~y.=0{vv                               NB. Succeed
   else. recs=: 0;'Movie ID ',(":y.),' not found.' end.     NB. Fail
   recs
NB.EG recs=. getMovieRecs 1234  NB. ret code;4-row MURD mat.
)

NB.* getUserRecs: get all Movie, User, Rating, Date records for user y..
getUserRecs=: 3 : 0
   if. y. e. UUIDS do.
       vn=. (0 i.~y.>2{IXUV){0{IXUV
       VDIR unfileVar_WS_ varnm=. 'umurd',":vn
       vv=. ".varnm
       4!:55 <varnm
       recs=. 1;vv#"1~y.=1{vv                          NB. Succeed
   else. recs=: 0;'User ID ',(":y.),' not found.' end. NB. Fail
   recs
NB.EG recs=. getUserRecs 2451718  NB. ret code;4-row MURD mat.
)

NB. onemore=: 13 : 'y.+.0,}:y.'   NB.* onemore: or bool w/next bit
NB. oneless=: 13 : 'y.+.}.y.,0'   NB.* oneless: or bool w/prev bit

rebuildUMURDfromURD=: 3 : 0
   dd=. y.
   maxi=. 3-~2^20        NB. Max items (cols) per output var: more would
   cumval=. 4 0$0        NB.  ->var size >16M; next size is 33M.
   outctr=. _1           NB. Output file number: increment before each use.
   for_ii. UMVIDS do.    NB. For each unique movie id #...
       varnm=. ' '-.~'urd',":ii
       rc=. dd unfileVar_WS_ varnm
       cumval=. cumval,.ii,tmpval=. ".varnm  NB. Movie atop User, Rating, Date
       4!:55 <varnm
       if. maxi<1{$cumval do.
           cumval=. (-1{$tmpval)}."1 cumval  NB. Drop piece that went over max
           outvar=. ' '-.~'umurd',":outctr=. >:outctr
           (outvar)=: cumval
           rc=. dd fileVar_WS_ outvar
           4!:55 <outvar
           cumval=. ii,tmpval      NB. Retain dropped piece for next var.
       end.
   end.
   if. 0<1{$cumval do.
       outvar=. ' '-.~'umurd',":outctr=. >:outctr
       (outvar)=: cumval
       rc=. dd fileVar_WS_ outvar
       4!:55 <outvar
   end.
   #UMVIDS
)

NB.* updateVarInfo: replace each (filed) var named after applying function.
updateVarInfo=: 1 : 0
   'dd varnm'=. y.                 NB. Vars dir, var names.
   rc=. dd unfileVar_WS_ varnm
   if. >{.rc do. (varnm)=: u. ".varnm
       rc=. dd fileVar_WS_ varnm
       [4!:55 <varnm
   end.
   rc
NB. rcs=. sortUserVars updateVarInfo&>(<VDIR);&.>UVN
)

sortUserVars=: 3 : 0
   y.{"1~/:|:1 0{y.      NB. Sort by user, movie ID.
)

NB.* txt2MN2: text of "movie: num1 num2..." file->numeric (movie num);nums.
txt2MN2=: 3 : 0
   split ".&>(}:&.>' '-.~&.>0{y.) 0}y.
NB.EG txt2MN2 '1:';'4';'3';'5';'5';'4';'4';'4';'3';'4';'5';'4';'5';'4';'3'
)

numPerBucket=: 3 : 0
   'bpv frqs'=. y.
   +/(frqs </}.bpv,>:>./frqs)*.frqs >:/ bpv
)

NB.* bucketBreakpoints: set values at which to break up a range into equal-sized
NB. buckets as much as possible, given number of buckets and frequencies of elements.
bucketBreakpoints=: 3 : 0
   'nb frqs'=. y.                  NB. Number of buckets, frequencies
   bp01=. nb%~i.nb
   bpi=. <.0.5+bp01*#frqs
   bpv=. bpi{/:~frqs                         NB. Simple way
   bpi3=. |:0>.(#frqs)<.bpi+"(0 1)_1 0 1     NB.  or average 3 values.
   bpv=. 3%~+/bpi3{/:~frqs
)

NB.* varUserRatings: find variance of ratings for each user: assume uAvgs global
varUserRatings=: 3 : 0
   if. -.nameExists 'VUR' do. VUR=: 0$~#UUIDS end.
   usix=. ~.allix=. UUIDS i. 1{y.
   usvar=. (1{y.) +//. *:(allix{uAvgs)-~2{y. NB. Accum sum of sq of diffs
   #VUR=: (usvar) usix}VUR
NB.EG [varUserRatings&>(<'C:\Netflix\');&.>'uvar1';'uvar2' [ 4!:55 <'VUR'
)

NB.* varMvRatings: find variance of ratings for each movie: assume mAvgs global
varMvRatings=: 3 : 0
   if. -.nameExists 'VMR' do. VMR=: 0$~#UMVIDS end.
   mvix=. ~.allix=. UMVIDS i. 0{y.
   mvvar=. (0{y.) +//. *:(allix{mAvgs)-~2{y. NB. Accum sum of sq of diffs
   #VMR=: (mvvar) mvix}VMR
NB.EG [varMvRatings&>getVarInfo(<'C:\Netflix\');&.>'uvar1';'uvar2' [ 4!:55 <'VMR'
)

NB.* meanUserRatings: find mean ratings for each user: total and count.
meanUserRatings=: 3 : 0
   if. -.nameExists 'MUR' do. MUR=: 0$~3,#UUIDS end.
   usrsum=. (1{y.) +//. 2{y.       NB. Accumulate sums and
   usrct=. (1{y.) #/. 1{y.         NB.  count of items
   usrix=. UUIDS i. ~.1{y.
   usrsd=. (1{y.) stddev/. 2{y.    NB.  std dev of ratings
   #MUR=: (usrsum,usrct,:usrsd) usrix}"1 MUR
NB.EG [meanUserRatings&>(<'C:\Netflix\');&.>'uvar1';'uvar2' [ 4!:55 <'MUR'
)

NB.* meanMovieRatings: find mean ratings per movie: total, count, std dev.
meanMovieRatings=: 3 : 0
   if. -.nameExists 'MMR' do. MMR=: 0$~3,#UMVIDS end.
   mvrsum=. (0{y.) +//. 2{y.       NB. Accumulate sums and
   mvrct=. (0{y.) #/. 0{y.         NB.  count of items
   mvrsd=. (0{y.) stddev/. 2{y.    NB.  std dev of ratings
   mvix=. UMVIDS i. ~.0{y.
   #MMR=: (mvrsum,mvrct,:mvrsd) mvix}"1 MMR
NB.EG meanMovieRatings getVarInfo&>(<VDIR);&.>MVN' [ 4!:55 <'MMR'
)

NB.* createProbeRatingFile: read file of movie: users -> movie: scores
createProbeRatingFile=: 3 : 0
   svprb=. >selProbeScores getVarInfo&.>(<VDIR);&.>MVN
NB. 20061024: confirm that result is in same order as input...
   svprb=. 1{"1 svprb
   svprb=. >,.&.>/svprb
   svprb=. svprb{"1~/:0{svprb
)

NB.* writeProbeRatingFile: write ratings file given MURD var.
writeProbeRatingFile=: 3 : 0
NB.   svprb=: createProbeRatingFile ''
   'flnm prb'=. y.
   mvptn=. 1,2~:/\0{prb
   mvnums=. ':',~&.>":&.>mvptn#0{prb
   ratings=. ":&.>2{prb
   ratings=. mvptn<;._1 ratings
   outfl=. ;(mvnums,&.>LF),&.>ratings,&>&.>LF
   outfl=. (,outfl)-.' '
   outfl fwrite flnm
NB. writeProbeRatingFile (dd,'probeRating.txt');<createProbeRatingFile ''
)

NB.* getProbeFile: get file (movie1: users; movie2: users...)->movies<;.1 users
getProbeFile=: 3 : 0
NB.   prb=: f2v dd,'probe.txt'
   prb=: f2v y.
   whmvid=. ':'e.&>prb
   mvids=. ".&>}:&.>whmvid#prb
   uipm=. ;&.>whmvid<;._1 ".&.>(-whmvid)}.&.>prb
   mvids;<uipm
)

NB.* selProbeScores: given MURD var, look up movie, user pairs in global MVUS.
selProbeScores=: 3 : 0
NB. MVUS=: mvids+&>1e_7*&.>uipm     NB. See getProbeFile for "mvids" and "uipm"
   ykey=. 1 1e_7+/ . * 2{.y.        NB. Combine movie, user into floating #.
   selmu=. ykey e. MVUS
NB.   MVUS=: MVUS#~-.MVUS e. ykey   NB. Faster to avoid this data movement.
   selmu#"1 y.
)

NB. Not used - too space-wasteful to combine movie, user into complex number.
selectProbeRatings=: 3 : 0
   (prum e.~ 1 0j1 +/ . *2{.y.)#"1 y.
)

NB.* userRatings: get user ratings from 1 var.
userRatings=: 3 : 0
   gv=. /:1{y.
   y.=. gv{"1 y.
   ptn=. 1,2~:/\1{y.
   ratings=. ptn<;.1 ] 2{y.
)

NB.* ordVars: order variables and re-save.
ordVars=: 4 : 0
   rc=. x. unfileVar_WS_ y.
   if. >{.rc do. var=. ".y.
       (y.)=: (/:1{var){"1 var
       rc=. x. fileVar_WS_ y.
       if. >{.rc do. 4!:55 <y.
       else. smoutput 'Could not save ','.',~y. end.
   else. smoutput 'Could not read ','.',~y. end.
   rc
)

NB.   gv=. /:1{y.
NB.   y.=. gv{"1 y.

NB.* userStats: statistics on user ratings.
userStats=: 3 : 0
   ptn=. 1,2~:/\1{y.
   umdstats=. ((#,<./,>./)&>ptn<;.1 ] 2{y.),.(<./,>./)&>ptn<;.1 ] 3{y.
   ums=. (mean,stddev)&>ptn<;.1]2{y.
   umdstats;ums
)

NB.* reordM2UFiles: re-order MURD (Movie, User, Rating, Date) files by User (or other row).
reordM2UFiles=: 3 : 0
   'indir ovars ivars uids'=. y.        NB. Vars dir, out var names, in var
   uids=. /:~uids                       NB.  names, unique IDs to sort on.
   if. -.nameExists 'WR' do. WR=. 1 end.     NB. Sort on Which Row?
   ix=. <.0.5+(nf%~<:#uids)*>:i.nf=. #ovars  NB. Partition ids equally
   maxepp=. ix{uids                     NB. Maximum Element Per Partition

   for_mvnm. ivars do. mvnm=. >mvnm     NB. Read all movie files for each
       rc=. indir unfileVar_WS_ mvnm    NB.  group of users.
       if. >{.rc do. mv=. ".mvnm
           whptn=. +/maxepp </ WR{mv    NB. Partition index/item
           for_ptn. /:~~.whptn do.      NB. Do each user partition
               rc=. indir unfileVar_WS_ uvnm=. >ptn{ovars
               if. >0{rc do.
                   (uvnm)=: (".uvnm),.(I. ptn=whptn){"1 mv
                   if. >0{rc=. indir fileVar_WS_ uvnm do.
                       4!:55 <uvnm
                   else. smoutput rc,<'unfileVar_WS_ ',uvnm end.
               else. smoutput rc,<'fileVar_WS_ ',uvnm end.
           end.
       else. smoutput rc end.
       4!:55 <mvnm
   end.
)

NB.* initUserKeyedFiles: initialize user-keyed files for reordM2UFiles.
initUserKeyedFiles=: 3 : 0
   'dd flprfx nf'=. y.   NB. Dir, file (var) prefix, number of files to init.
   nms=. ' '-.~&.>(<flprfx),&.>":&.>i.nf
   ".&.>nms,&.><'=: 4 0$0'
   (<dd) fileVar_WS_&.>nms
   [4!:55&><"0 nms
   nms
NB.EG nnms=. initUserKeyedFiles VDIR;'nwMURD';100
)

NB.* loadMMF: attempt load of memory-mapped file: runs out of room>34e6 items.
loadMMF=: 3 : 0
   'indir fnm'=. y.
   fnm=. indir,fnm,'.mmf'
   createjmf_jmf_ fnm;16*NUMITEMS  NB. 16 = 4 ints/item * 4 bytes/int
   JINT map_jmf_ 'mmurds';fnm
   mmurds=: 0$~NUMITEMS,4
   flpfx=. 'MURD'
   fls=. {."1 jfi dir indir,flpfx,'*.DAT'
   strow=. 0                       NB. NUMITEMS rows x 4 cols: start at row 0
   for_fl. fls do. fl=. >fl
       vflnm=. fl{.~fl i. '.'
       rc=. indir unfileVar_WS_ vflnm
       if. >{.rc do. len=. 1{$".vflnm
           mmurds=: (|:".vflnm) (strow+i.len)}mmurds
           strow=. strow+len
           [4!:55 <vflnm
       else. smoutput 'Failed to read var "','".',~vflnm
       end.
   end.
   fnm;<$mmurds
)
NB. unmapall_jmf_ '' [ unmap_jmf_ fnm

NB.* getVarInfo: apply arbitrary function to each (filed) var named.
getVarInfo=: 1 : 0
   'dd varnm'=. y.                 NB. Vars dir, var names.
   rc=. dd unfileVar_WS_ varnm
   if. >{.rc do. rc=. 1;u. ".varnm
       [4!:55 <varnm
   end.
   rc
NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3'
NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN
)

NB.* output1Piece: put out one piece of normalize (DB-formed) data: var "MURD".
output1Piece=: 3 : 0
   'indir dat'=. y. [ flpfx=. 'MURD'
   fls=. {."1 jfi dir indir,flpfx,'*.DAT'
   fnums=. ".&>(#flpfx)}.&.>fls{.~&.>fls i.&.>'.'
   if. 0=#fnums do.
       fnum=. 0
   else.
       fnum=. >:>./fnums
   end.
NB.   mnm=. 'MURD',' '-.~":<:<.flctr%blksz      NB. Number block
   mnm=. 'MURD',' '-.~":fnum                 NB. Number block
   (mnm)=: dat
   if. >{.indir fileVar_WS_ mnm do.          NB. Remove vars if successfully
       1 [ 4!:55 &.><&.>'MURD';<mnm             NB.  filed.
   else. 0 [ smoutput 'Failed to file "','".',~mnm
   end.
)

NB.* normizeTSData: normalize training set data->Movie num, User, Rating, Date.
normizeTSData=: 3 : 0
   'indir blksz'=. y.
   indir=. endSlash indir [ inpfx=. 'urd'
   fls=. {."1 jfi dir indir,inpfx,'*.dat'
   for_flctr. i.#fls do. fl=. >flctr{fls
       mnum=. ".(#inpfx)}.flpfx=. fl{.~fl i. '.'
       rc=. indir unfileVar_WS_ tolower flpfx
       if. >{.rc do.
           varnm=. >2{>1{rc
           if. 0=blksz|flctr do.             NB. Break output into blocks to
               if. nameExists <'MURD' do.    NB.  avoid memory problems
                   output1Piece indir;<MURD
               end.
               MURD=: 4 0$0                  NB. Start new block.
           end.
           MURD=: MURD,.mnum,".varnm
           [4!:55 <varnm                NB. Erase var
       else. smoutput 'Could not unfile "','".',~flpfx
       end.
   end.
   if. 0<#MURD do.
       output1Piece indir;<MURD
   end.
)

NB.* varizeTSFiles: convert training set files into J vars and save to file.
varizeTSFiles=: 3 : 0
   'indir outdir'=. endSlash&.>y.
   fls=. {."1 jfi dir indir,'mv*.txt'
   szs=. 0$~#fls
   duperatings=. 0$~#fls
   for_flctr. i.#fls do. fl=. >flctr{fls
       mvdata=. mungeTrainingSet indir,fl
       if. >{.mvdata do.
           'rc mnum urd'=. mvdata
           (newnm=. 'urd',":mnum)=: urd
           if. >{.outdir fileVar_WS_ newnm do.
               szs=. (howbig 'urd') flctr}szs
               un=. #&>~.&.><"1 urd
               duperatings=. (({.un)~:1{$urd) flctr}duperatings
           else. smoutput 'Could not file ','.',~newnm end.
           [4!:55 <newnm
       else. smoutput 'Problem munging ','.',~fl end.
   end.
   szs;<duperatings
NB.EG 'C:\Data\Netflix\training_set\' varizeTSFiles  'C:\Data\Netflix\VarsDir\'
)

howbig=: 7!:5@:<         NB.* howbig: Size of variable given its name.
NB. Pair of date converters ('YYYY-MM-DD'->YYYYMMDD) - first is more
NB. careful (doesn't assume 2-digit months and days); second is much faster.
cvtNDt=: 13 : '".;_4 _2 _2{.&.>(<''00''),&.>dsp&.><;._1 ''-'',y.'
cvtNDt2=: 13 : '".y.-.''-'''

NB.* mungeTrainingSet: convert a training set text file into vars.
mungeTrainingSet=: 3 : 0
   'mnum urd'=. split f2v y.
   mnum=. ".':'-.~;mnum
   urd=. <;._1 &> ',',&.>urd
   urd=. urd#~0 +./ .~: |:#&>urd

NB. 1.6304726 <-> 6!:2 'ad2=. cvtNDt2&> 2{"1 urd'
NB. 12.787668 <-> 6!:2 'ad=. cvtNDt&> 2{"1 urd'
NB. 1 <-> ad-:ad2

   ad2=. cvtNDt2&> 2{"1 urd
   ad=. cvtNDt&> 2{"1 urd
NB. This tells us if there is an inconsistency in the YYYY-MM-DD date format.
   if. -. ad-:ad2 do.
       rc=. 0;'Date mismatch';ad;<ad2
       return.
   end.

   ivn=. -.|:isValNum&>2{."1 urd
   whbad=. ,(i.#ivn),&.>I."1 ivn
   if. 0~:#whbad do.
       rc=. 0;('Invalid number','s'#~1~:#whbad);<whbad
       return.
   end.

   urd=. ad,~|:".&>2{."1 urd
   1;mnum;<urd      NB. Good return; Movie Number; User number, Rating, Date
NB.EG mvdata=. mungeTrainingSet '\Data\Netflix\training_set\mv_0005317.txt'
)

NB.* mungeMovieTitles: put movie titles file into an array.
mungeMovieTitles=: 3 : 0
NB. Put each LF-delimited line into a cell, then break apart cells based on
NB. comma delimiters.
   movies=. ><;._1&.>',',&.><;._1 LF,fread y.

NB. After breaking columns on commas, must fix titles w/embedded commas
NB. so we can put whole title into single column.
   titonly=. 2}."1 movies
   titonly=. ;&.> <"1 (0~:#&>titonly)#&.>',',~&.>dsp&.>titonly-.&.>CR
   titonly=. }:&.>titonly

   wh=. I.','+./ . e.&>titonly     NB. Find titles with embedded commas.
   titonly=. ((<',';', ') replace&.>wh{titonly) wh}titonly
   movies=. }:(2{."1 movies),.titonly
   mvhdr=. 'num';'year';'title'

NB. Convert numeric fields to numbers:
   mm=. |:".&.>2{."1 movies
   wh=. I. 1~:#&>1{mm              NB. Where are bad year numbers?
   mm=. |:2{."1 movies
   mm=. ((<'0') wh}1{mm)1}mm       NB. Bad year->0
   mnhdr=. 2{.mvhdr [ mn=. ".&>mm
   titles=. movies{"1~mvhdr i. <'title'
   mnhdr;mn;<titles                NB. Movie nums header, nums, titles
NB.EG mungeMovieTitles 'C:\Data\Netflix\movie_titles.txt'
NB.EG 'mnhdr mn titles'=. mungeMovieTitles dd,'movie_titles.txt'
)

== Examples of Using the Code ==
Here we show how we can manipulate the data.
<pre>
   I.','+./ . e.&>titonly
71 263 349 365 393 465 581 599 669 671 728 775 826 833 890 912 943 972 1009 1014 1057 1094 1169 1209 1285 1519 1558 1628 1646 1653 1657 1673 1744 1751 2106 2175 2177 2360 2409 2427 2432 2461 2597 2624 2627 2690 2721 2771 2795 2829 2853 2877 2888 2946 3022 ...
   71 263 349{titonly
+------------------------------------------------+-----------------------------------------+----------------------------------+
|At Home Among Strangers,A Stranger Among His Own|Angelina Ballerina: Lights,Camera,Action!|Dr. Quinn,Medicine Woman: Season 3|
+------------------------------------------------+-----------------------------------------+----------------------------------+
   wh=. I.','+./ . e.&>titonly
   titonly=. ((<',';', ') replace&.>wh{titonly) wh}titonly
   71 263 349{titonly
+-------------------------------------------------+-------------------------------------------+-----------------------------------+
|At Home Among Strangers, A Stranger Among His Own|Angelina Ballerina: Lights, Camera, Action!|Dr. Quinn, Medicine Woman: Season 3|
+-------------------------------------------------+-------------------------------------------+-----------------------------------+
   $movies=. }:(2{."1 movies),.titonly
17770 3
   _5{.movies
+-----+----+----------------------------------------------------------+
|17766|2002|Where the Wild Things Are and Other Maurice Sendak Stories|
+-----+----+----------------------------------------------------------+
|17767|2004|Fidel Castro: American Experience                         |
+-----+----+----------------------------------------------------------+
|17768|2000|Epoch                                                     |
+-----+----+----------------------------------------------------------+
|17769|2003|The Company                                               |
+-----+----+----------------------------------------------------------+
|17770|2003|Alien Hunter                                              |
+-----+----+----------------------------------------------------------+
   (10{.wh){movies
+---+----+-------------------------------------------------+
|72 |1974|At Home Among Strangers, A Stranger Among His Own|
+---+----+-------------------------------------------------+
|264|2002|Angelina Ballerina: Lights, Camera, Action!      |
+---+----+-------------------------------------------------+
|350|1993|Dr. Quinn, Medicine Woman: Season 3              |
+---+----+-------------------------------------------------+
|366|2004|Still, We Believe: The Boston Red Sox Movie      |
...
|600|1966|What's Up, Tiger Lily?                           |
+---+----+-------------------------------------------------+
|670|2002|He Loves Me, He Loves Me Not                     |
+---+----+-------------------------------------------------+
|672|1991|He Said, She Said                                |
+---+----+-------------------------------------------------+

   wh=. I. 1~:#&>1{mm
   wh{movies
+-----+----+-------------------------------------------+
|4388 |NULL|Ancient Civilizations: Rome and Pompeii    |
+-----+----+-------------------------------------------+
|4794 |NULL|Ancient Civilizations: Land of the Pharaohs|
+-----+----+-------------------------------------------+
|7241 |NULL|Ancient Civilizations: Athens and Greece   |
+-----+----+-------------------------------------------+
|10782|NULL|Roti Kapada Aur Makaan                     |
...
|15918|NULL|Hote Hote Pyaar Ho Gaya                    |
+-----+----+-------------------------------------------+
|16678|NULL|Jimmy Hollywood                            |
+-----+----+-------------------------------------------+
|17667|NULL|Eros Dance Dhamaka                         |
+-----+----+-------------------------------------------+
   NB. Jimmy Hollywood 1994; Roti Kapada Aur Makaan 1974; Hote Hote Pyaar Ho Gaya 1999

)

Advanced topics

The group discussed techniques to explore for working with this data to make forecasts.

Learning and Teaching J

We discussed whether J has enough beginner material or if it could be improved. The Beginner's Regatta section of our meetings is oriented toward beginners as a way to provide more points of entry into J.


   There are three types of information:
       * Need to know
       * Nice to know
       * Get a life
     - Shadow Wrought (586631)