NYCJUG/2017-08-08
When eras die, their legacies Are left to strange police. Professors in New England guard The glory that was Greece. - Clarence Day
Meeting Agenda for NYCJUG 20170808
Meeting Agenda for NYCJUG 20170808 ---------------------------------- 1. Beginner's regatta: see "Finding Differences between Files". 2. Show-and-tell: see "Mass Table Creation". 3. Advanced topics: Arthur Whitney's comment that "Google invented Map-Reduce in 2004 and Ken Iverson cleverly re-invented it in 1964" led me to investigate this concept more deeply; see "Write Your First MapReduce Program in 20 Minutes". 4. Learning and teaching J: see "Kdb+ Mastermind Challenge".
Beginner's regatta
We look at a technique for isolating differences between similar files.
Finding Differences between Files
Say we have some moderately large text files that were generated from different versions of a database:
Directory of C:\amisc\Somesys\Data\DiffsCompare 06/15/2017 10:19 AM 78,064,646 ver201703_.csv 06/15/2017 09:31 AM 80,718,136 ver201706_.csv
Preliminaries: Reading Data and Doing Base Comparisons
We want to know what are the specific differences between them, so we read them in and do some basic comparisons. One thing we know is that the first field is a date and the second is a company name and that these distinguish a record.
dd=. <'C:\amisc\Somesys\Data\' fls=. 'ver201703_.csv';'ver201706_.csv' 'fl0 fl1'=. fread&.>dd,&.>fls $fl0=. <;._1&>',',&.><;._2 ] LF (] , [ #~ [ ~: [: {: ]) CR-.~fl0 572009 14 $fl1=. <;._1&>',',&.><;._2 ] LF (] , [ #~ [ ~: [: {: ]) CR-.~fl1 585319 15
One thing we know is that the first field is a date and the second is a company name and that these distinguish a record.
dts0=. 0{"1 fl0 [ dts1=. 0{"1 fl1 $dts0-.dts1 0 $dts0-.~dts1 0
This shows us we have the same set of dates between the two files.
Now let's pick all the dates and names - column indexes 0 and 1 - from each file to form two sets of keys.
keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1 3{.keys0 +-----------------+----------------------------------+... |Date$company_name|02/29/2004FORMOSA CHEMICAL & FIBER|... +-----------------+----------------------------------+... ...----------------------------------+ ...02/28/2005FORMOSA CHEMICAL & FIBER| ...----------------------------------+
We see that there are keys that do not match:
$keys0-.keys1 29 $keys0-.~keys1 13339
So the first dataset has 29 keys the second one does not and the second has over 13,000 the first does not.
Taking the smaller mismatch, we find where these records are:
$keys0 i. keys0-.keys1 29
Check that there are not multiple instances of the mismatches:
$I. keys0 e. keys0-.keys1 29 wh0n1=. keys0 i. keys0-.keys1 load 'dsv'
Reducing the Datasets
We want to reduce both datasets to include only those records found in both based on a couple of key columns.
Extract the records with keys found in the “version 3” file but not in “version 6” and save them in case we want to look at them later.
(wh0n1{fl0) writedsv 'In03not06.csv';',';'' 3211
Remove the records with mismatched keys from the appropriate file:
$fl0=. (<^:3]wh0n1){fl0 571981 14
Regenerate both sets of keys – do both mainly because this line will be re-usable no matter which of the two files we changed.
keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1
See that one mismatch still remains:
$keys0-.keys1 1 keys0-.keys1 +-------------------------------+ |02/28/2014DAISHIN SECURITIES CO| +-------------------------------+
Repeat the process of removing and saving the mismatch, then removing that record from the appropriate file:
]wh0n1=. keys0 i. keys0-.keys1 35427 (wh0n1{fl0) writedsv 'In03not06-2.csv';',';'' 113 $fl0=. (<^:3]wh0n1){fl0 571980 14
Upon re-generating the keys, we see that all mismatches from “fl0” are gone but that some remain in the other file:
keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1 $keys0-.keys1 0 $keys0-.~keys1 13339 $keys1 i. keys0-.~keys1 13339 $I. keys1 e. keys0-.~keys1 13339
Repeat the preceding process of extracting records with mismatched keys, saving these, and removing the mismatches from the appropriate file:
wh1n0=. keys1 i. keys0-.~keys1 (wh1n0{fl1) writedsv 'In06not03.csv';',';'' 1853586 $fl1 585319 15 $fl1=. (<^:3]wh1n0){fl1 572152 15
Continue with the same process as before until there are only records with matching keys in the two files:
keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1 $keys0-.~keys1 172 wh1n0=. keys1 i. keys0-.~keys1 (wh1n0{fl1) writedsv 'In06not03-2.csv';',';'' 20607 $fl1=. (<^:3]wh1n0){fl1 571981 15 keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1 $keys0-.~keys1 1 wh1n0=. keys1 i. keys0-.~keys1 (wh1n0{fl1) writedsv 'In06not03-3.csv';',';'' 123 $fl1=. (<^:3]wh1n0){fl1 571980 15 keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1 $keys0-.~keys1 0
Verifying Keys Match but Other Data Differs
Now we have two files that have only records that can be matched to each other based on their keys, but first check that the remaining records still differ; we check only the first 13 columns because of some meaningless differences after this column:
(13{."1 fl0) -: 13{."1 fl1 0
Now we can pull out the mismatched records from one file:
6!:2 'd01=. (13{."1 fl0)-.13{."1 fl1' 742.056 $d01 34922 13
Do the same for the other file:
6!:2 'd10=. (13{."1 fl0)-.~13{."1 fl1' 753.685 $d10 34922 13
Write out the differing records in two files:
d01 writedsv 'DiffsIn_ver03.csv';',';'' 4465021 d10 writedsv 'DiffsIn_ver06.csv';',';'' 5295105
Show-and-tell
We explore how to generate large amounts of test data.
Mass Table Creation
We have a large, multi-column table of data items associated with a particular date and security identifier (known as a SEDOL). For performance reasons, we want to break out each data column as a separate table. The original table was a comma-delimited file like this:
fsize flnm=. 'c:/amisc/Somesys/clients/QXM_75_20160909- SEDOL.csv' 755992309 fread flnm;0,1000 datadate,SEDOL,CFROICHG,CFROIKM,PERTOBST,ADJVT,CFSURP,DEV,DRDIFF,DRUSED,GRUSED,ME,CFROIUSD,VCRATIO,WDRATIO,PCTFUTUR,FRUSED,BSTTPRC,PEFM,PEFY2M,EPSFGFY1,EPSFGFY2,EPSFY1,EPSFY2,PRCCMHLT,CFROISPF0,EPSGR5Y,SALEGRF0,PECURPRC,PEBITDAF0 11/11/2016,#N/A,-1.9,2.52,-50,-1.24,273.6,31,-1,3.01,18.47,1.007,1.4,3.59,4.3,67.93,66,5.94,-70.29,-170.71,-78,-59,-0.170009954,-0.070001753,11.95,-0.5,,14.5,-59.85,338.33 5/11/2012,2000019,-0.024,-0.593999982,-23.16500092,-0.345999986,18.56100082,32.76499939,-0.796999991,4.855000019,29.92399979,102.4560013,18.54999924,7.953999996,5.355999947,78.47000122,32.61299896,174.9389954,208.8809967,93.31099701,-21.58300018,123.852996...
This was loaded into a database table like this:
fread 'ClientDBLoad.txt';10007,11000 5/23/2008 2000019 DEV 62.38299942 5/23/2008 2000019 DRDIFF -0.616999984 5/23/2008 2000019 DRUSED 4.605999947 5/23/2008 2000019 GRUSED 22.2689991 5/23/2008 2000019 ME 32.67200089 …
The columns of this table are Date, SEDOL, Factor, Value.
We start by creating a template for table creation (this is in a MySQL database):
tmplt=. 0 : 0 create table {tbl} ( Date DATE, SEDOL varchar(7), Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8; insert into {tbl} select Date,SEDOL,Value from client where Factor='{tbl}'; )
We want to make a table for each of these factors:
#factors 28 factors +--------+-------+--------+-----+------+---+------+------+------+----+... |CFROICHG|CFROIKM|PERTOBST|ADJVT|CFSURP|DEV|DRDIFF|DRUSED|GRUSED|CFROI... +--------+-------+--------+-----+------+---+------+------+------+----+...
Create the SQL statements to do this using our template, substituting each factor name for the expression “{tbl}”:
templates=. (<tmplt) rplc &.><"1 (<'{tbl}'),.factors $templates 28 >0{templates create table CFROICHG ( Date DATE, SEDOL varchar(7), Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8; insert into CFROICHG select Date,SEDOL,Value from client where Factor='CFROICHG';
We write this out to a file to feed into the database, then use the above technique to generate the SQL to remove the factors for which we created a separate table from the parent table.
(;LF,~&.>templates) fwrite 'createFactorTables.sql' 5315 NB. delete from client where Factor='{tbl}'; tmplt=. 0 : 0 select NOW(); create table {tbl} ( Date DATE, SEDOL varchar(7), Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8; insert into {tbl} select Date,SEDOL,Value from client where Factor='{tbl}'; select NOW(); delete from client where Factor='{tbl}'; select NOW(); ) $templates=. (<tmplt) rplc &.><"1 (<'{tbl}'),.factors 28
We generate 28 customized SQL statements by replacing the text in our template with the factor names. Let's check the first one.
>0{templates select NOW(); create table CFROICHG ( Date DATE, SEDOL varchar(7), Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8; insert into CFROICHG select Date,SEDOL,Value from client where Factor='CFROICHG'; select NOW(); delete from client where Factor='CFROICHG'; select NOW();
The SQL we generate incorporates “NOW()” statements to show the current time before and after each step as a crude way to figure out how long each part of the process takes.
Advanced topics
[1].
Learning and Teaching J
Materials
Double brackets allow abbreviated (local) links but require a pipe to separate the associated text, like this:
Building a GUI in J
- Attachment: * attachment:
- Link: [https:]
- Picture:
- ↑ Example footnote
the monadic verb +x actually produces the complex conjugate of x.