User:Devon McCormick/JDBWithNetflixChallengeData
Here's my experience so far attempting to use JDB on the Netflix Challenge dataset.
First, I load JDB, then create my folder, database, and table for Netflix data:
load '~addons/data/jdb/jdb.ijs' hf=: Open_jdb_ 'C:\Data\Netflix' hd=: Create__hf 'nfbase' ht=: Create__hd 'mcrd';0 : 0 movid int custno int rating byte rdate int )
Notice the "in-line" table definition beginning with "0 : 0" above. There are four fields: the movie ID number, the customer number, the customer rating of the movie, and the rating date.
Now, I attempt to load the Netflix data into the database but, first, a word about this data and how I currently work with it. Since there's too much data to handle all at once - about 100 million records - I've broken it into 100 numbered variables which I've written to file. Each variable is a four-row table of integers, as seen here in variable "murd0" retrieved from file:
VDIR \Data\Netflix\VarsDir\ VDIR unfileVar_WS_ 'murd0' NB. Movie, User, Rating, Date +-+----------------------------------------+ |1|+----------------------+---------+-----+| | ||\Data\Netflix\VarsDir\|MURD0.DAT|murd0|| | |+----------------------+---------+-----+| +-+----------------------------------------+ $murd0 4 982878 5{."1 murd0 1 1 1 1 1 1596531 1366860 1181550 1227322 2413320 5 4 3 4 4 20040123 20040126 20040201 20040206 20040206
As you can see, this is a matrix of integers with a little less than one million columns. In the table definition, I attempted to save some space by defining the rating as a byte since it can only have the values 1-5.
I'm attempting to load this data from these 100 variables into the database in the way I usually deal with this data - using an adverb that applies a verb to specified variables on file:
NB.* getVarInfo: apply arbitrary function to each (filed) var named. getVarInfo=: 1 : 0 'dd varnm'=. y. NB. Vars dir, var names. rc=. dd unfileVar_WS_ varnm if. >{.rc do. rc=. 1;u. ".varnm [4!:55 <varnm end. rc NB.EG ({."1,.{:"1) getVarInfo &.>(<'C:\data\');&.>'var1';'var2';'var3' NB.EG dts=: (3&{) getVarInfo&.>(<VDIR);&.>MVN )
I have a list "MVN" of the variable names "murd0" through "murd99", so after defining the database update verb like this:
insRecs=: 3 : 0 Insert__hd 'mcrd';<(<'012345'{~>2{y) 2}y=. <"1 y )
As you can see, I convert the rating to a character on-the-fly. So, I should be able to do this to load all the records into the database:
rc=. insRecs getVarInfo&.>(<VDIR);&.>MVN
However, after processing about 30 of these million column variables, I get this error:
|bad view: assert | 'bad view' assert 0[free fh,mh,0
Here's what the table directory looks like at this point:
dir 'C:\Data\Netflix\nfbase\mcrd\*.*' +--+-------------------+---------+---+------+ |c0|2008 10 19 21 15 40|136762360|rw-|-----a| +--+-------------------+---------+---+------+ |c1|2008 10 19 21 15 40|136762360|rw-|-----a| +--+-------------------+---------+---+------+ |c2|2008 10 19 21 15 40|136762360|rw-|-----a| +--+-------------------+---------+---+------+ |c3|2008 10 19 21 15 40|136762360|rw-|-----a| +--+-------------------+---------+---+------+ |c4|2008 10 19 21 15 40|34191553 |rw-|-----a| +--+-------------------+---------+---+------+ |c5|2008 10 19 21 15 40|136762360|rw-|-----a| +--+-------------------+---------+---+------+ |d0|2008 10 19 21 15 40|136762360|rw-|-----a| +--+-------------------+---------+---+------+ |d1|2008 10 19 21 9 24 |5284 |rw-|-----a| +--+-------------------+---------+---+------+ |d2|2008 10 19 21 9 24 |1296 |rw-|-----a| +--+-------------------+---------+---+------+ +/,>2{"1 dir 'C:\Data\Netflix\nfbase\mcrd\*.*' 854772293
There are some records loaded:
Reads__hd 'from mcrd where custno=25' +-----+------+------+--------+ |movid|custno|rating|rdate | +-----+------+------+--------+ |3427 |25 |2 |20050831| |4432 |25 |5 |20050831| |4656 |25 |3 |20050831| |4996 |25 |2 |20050831| |5317 |25 |4 |20050831| |6037 |25 |4 |20050831| |6786 |25 |5 |20050831| +-----+------+------+--------+
but not all of them:
0 0$''[VDIR unfileVar_WS_ 'murd99' 5{."1 murd99 17622 17622 17622 17622 17622 816549 359127 921587 2483371 1265459 3 4 2 4 5 20011216 20011218 20011218 20011221 20011222 NB. "17622" is first movie number in last var: Reads__hd 'from mcrd where movid=17622' +-----+------+------+-----+ |movid|custno|rating|rdate| +-----+------+------+-----+ +-----+------+------+-----+
The following appears to be the last variable from file loaded:
0 0$''[VDIR unfileVar_WS_ 'murd37' 5{."1 murd37 6722 6722 6722 6722 6722 1539813 302850 1337655 1003230 2012582 1 4 5 5 5 20030417 20030420 20030420 20030424 20030425 $&.>Reads__hd 'from mcrd where movid=6722' +------+------+------+------+ |5 |6 |6 |5 | +------+------+------+------+ |1105 1|1105 1|1105 1|1105 1| +------+------+------+------+
So, I was able to get a little over one-third of the data loaded. I suspect I'm running into system-related limits to paging or other memory space though I have a 4GB pagefile defined on this machine.
I got the following message from JDB Browser upon attempting to open "Data" tab:
Query: movid,custno,rating,rdate from mcrd |out of memory: getwhere | I.active No data
See further JDB/Netflix -- Oleg Kobchenko <<DateTime(2008-10-22T06:09:46Z)>>