NYCJUG/2017-08-08

From J Wiki
Jump to navigation Jump to search
When eras die, their legacies 
Are left to strange police. 
Professors in New England guard 
The glory that was Greece.
  - Clarence Day

Meeting Agenda for NYCJUG 20170808

             Meeting Agenda for NYCJUG 20170808
             ----------------------------------
1. Beginner's regatta: see "Finding Differences between Files".

2. Show-and-tell: see "Mass Table Creation".

3. Advanced topics: Arthur Whitney's comment that "Google invented
Map-Reduce in 2004 and Ken Iverson cleverly re-invented it in 1964" led me to
investigate this concept more deeply; see "Write Your First MapReduce Program
in 20 Minutes".

4. Learning and teaching J: see "Kdb+ Mastermind Challenge".

Beginner's regatta

We look at a technique for isolating differences between similar files.

Finding Differences between Files

Say we have some moderately large text files that were generated from different versions of a database:

Directory of C:\amisc\Somesys\Data\DiffsCompare
06/15/2017  10:19 AM        78,064,646 ver201703_.csv
06/15/2017  09:31 AM        80,718,136 ver201706_.csv

Preliminaries: Reading Data and Doing Base Comparisons

We want to know what are the specific differences between them, so we read them in and do some basic comparisons. One thing we know is that the first field is a date and the second is a company name and that these distinguish a record.

   dd=. <'C:\amisc\Somesys\Data\'
   fls=. 'ver201703_.csv';'ver201706_.csv'
   'fl0 fl1'=. fread&.>dd,&.>fls
   $fl0=. <;._1&>',',&.><;._2 ] LF (] , [ #~ [ ~: [: {: ]) CR-.~fl0
572009 14
   $fl1=. <;._1&>',',&.><;._2 ] LF (] , [ #~ [ ~: [: {: ]) CR-.~fl1
585319 15

One thing we know is that the first field is a date and the second is a company name and that these distinguish a record.

   dts0=. 0{"1 fl0 [ dts1=. 0{"1 fl1
   $dts0-.dts1
0
   $dts0-.~dts1
0

This shows us we have the same set of dates between the two files.

Now let's pick all the dates and names - column indexes 0 and 1 - from each file to form two sets of keys.

   keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1
   3{.keys0
+-----------------+----------------------------------+...
|Date$company_name|02/29/2004FORMOSA CHEMICAL & FIBER|...
+-----------------+----------------------------------+...
...----------------------------------+
...02/28/2005FORMOSA CHEMICAL & FIBER|
...----------------------------------+

We see that there are keys that do not match:

   $keys0-.keys1
29
   $keys0-.~keys1
13339

So the first dataset has 29 keys the second one does not and the second has over 13,000 the first does not.

Taking the smaller mismatch, we find where these records are:

   $keys0 i. keys0-.keys1
29

Check that there are not multiple instances of the mismatches:

   $I. keys0 e. keys0-.keys1
29
   wh0n1=. keys0 i. keys0-.keys1
   load 'dsv'

Reducing the Datasets

We want to reduce both datasets to include only those records found in both based on a couple of key columns.

Extract the records with keys found in the “version 3” file but not in “version 6” and save them in case we want to look at them later.

   (wh0n1{fl0) writedsv 'In03not06.csv';',';''
3211

Remove the records with mismatched keys from the appropriate file:

   $fl0=. (<^:3]wh0n1){fl0
571981 14

Regenerate both sets of keys – do both mainly because this line will be re-usable no matter which of the two files we changed.

   keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1

See that one mismatch still remains:

   $keys0-.keys1
1
   keys0-.keys1
+-------------------------------+
|02/28/2014DAISHIN SECURITIES CO|
+-------------------------------+

Repeat the process of removing and saving the mismatch, then removing that record from the appropriate file:

   ]wh0n1=. keys0 i. keys0-.keys1
35427
   (wh0n1{fl0) writedsv 'In03not06-2.csv';',';''
113
   $fl0=. (<^:3]wh0n1){fl0
571980 14

Upon re-generating the keys, we see that all mismatches from “fl0” are gone but that some remain in the other file:

   keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1
   $keys0-.keys1
0
   $keys0-.~keys1
13339
   $keys1 i. keys0-.~keys1
13339
   $I. keys1 e. keys0-.~keys1
13339

Repeat the preceding process of extracting records with mismatched keys, saving these, and removing the mismatches from the appropriate file:

   wh1n0=. keys1 i. keys0-.~keys1
   (wh1n0{fl1) writedsv 'In06not03.csv';',';''
1853586
   $fl1
585319 15
   $fl1=. (<^:3]wh1n0){fl1
572152 15

Continue with the same process as before until there are only records with matching keys in the two files:

   keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1
   $keys0-.~keys1
172
   wh1n0=. keys1 i. keys0-.~keys1
   (wh1n0{fl1) writedsv 'In06not03-2.csv';',';''
20607
   $fl1=. (<^:3]wh1n0){fl1
571981 15
   keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1
   $keys0-.~keys1
1
   wh1n0=. keys1 i. keys0-.~keys1
   (wh1n0{fl1) writedsv 'In06not03-3.csv';',';''
123
   $fl1=. (<^:3]wh1n0){fl1
571980 15
   keys0=. ;&.><"1]0 1{"1 fl0 [ keys1=. ;&.><"1]0 1{"1 fl1
   $keys0-.~keys1
0

Verifying Keys Match but Other Data Differs

Now we have two files that have only records that can be matched to each other based on their keys, but first check that the remaining records still differ; we check only the first 13 columns because of some meaningless differences after this column:

   (13{."1 fl0) -: 13{."1 fl1
0

Now we can pull out the mismatched records from one file:

   6!:2 'd01=. (13{."1 fl0)-.13{."1 fl1'
742.056
   $d01
34922 13

Do the same for the other file:

   6!:2 'd10=. (13{."1 fl0)-.~13{."1 fl1'
753.685
   $d10
34922 13

Write out the differing records in two files:

   d01 writedsv 'DiffsIn_ver03.csv';',';''
4465021
   d10 writedsv 'DiffsIn_ver06.csv';',';''
5295105

Show-and-tell

We explore how to generate large amounts of test data.

Mass Table Creation

We have a large, multi-column table of data items associated with a particular date and security identifier (known as a SEDOL). For performance reasons, we want to break out each data column as a separate table. The original table was a comma-delimited file like this:

   fsize flnm=. 'c:/amisc/Somesys/clients/QXM_75_20160909- SEDOL.csv'
755992309
   fread flnm;0,1000
datadate,SEDOL,CFROICHG,CFROIKM,PERTOBST,ADJVT,CFSURP,DEV,DRDIFF,DRUSED,GRUSED,ME,CFROIUSD,VCRATIO,WDRATIO,PCTFUTUR,FRUSED,BSTTPRC,PEFM,PEFY2M,EPSFGFY1,EPSFGFY2,EPSFY1,EPSFY2,PRCCMHLT,CFROISPF0,EPSGR5Y,SALEGRF0,PECURPRC,PEBITDAF0
11/11/2016,#N/A,-1.9,2.52,-50,-1.24,273.6,31,-1,3.01,18.47,1.007,1.4,3.59,4.3,67.93,66,5.94,-70.29,-170.71,-78,-59,-0.170009954,-0.070001753,11.95,-0.5,,14.5,-59.85,338.33
5/11/2012,2000019,-0.024,-0.593999982,-23.16500092,-0.345999986,18.56100082,32.76499939,-0.796999991,4.855000019,29.92399979,102.4560013,18.54999924,7.953999996,5.355999947,78.47000122,32.61299896,174.9389954,208.8809967,93.31099701,-21.58300018,123.852996...

This was loaded into a database table like this:

   fread 'ClientDBLoad.txt';10007,11000
5/23/2008	2000019	DEV	62.38299942
5/23/2008	2000019	DRDIFF	-0.616999984
5/23/2008	2000019	DRUSED	4.605999947
5/23/2008	2000019	GRUSED	22.2689991
5/23/2008	2000019	ME	32.67200089
…

The columns of this table are Date, SEDOL, Factor, Value.

We start by creating a template for table creation (this is in a MySQL database):

   tmplt=. 0 : 0
create table {tbl}
(   Date DATE,
    SEDOL varchar(7),
    Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8;
insert into {tbl} select Date,SEDOL,Value from client where Factor='{tbl}';
)

We want to make a table for each of these factors:

   #factors
28
   factors
+--------+-------+--------+-----+------+---+------+------+------+----+...
|CFROICHG|CFROIKM|PERTOBST|ADJVT|CFSURP|DEV|DRDIFF|DRUSED|GRUSED|CFROI...
+--------+-------+--------+-----+------+---+------+------+------+----+...

Create the SQL statements to do this using our template, substituting each factor name for the expression “{tbl}”:

templates=. (<tmplt) rplc &.><"1 (<'{tbl}'),.factors
   $templates
28
   >0{templates
create table CFROICHG
(   Date DATE,
    SEDOL varchar(7),
    Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8;
insert into CFROICHG select Date,SEDOL,Value from client where Factor='CFROICHG';

We write this out to a file to feed into the database, then use the above technique to generate the SQL to remove the factors for which we created a separate table from the parent table.

   (;LF,~&.>templates) fwrite 'createFactorTables.sql'
5315
   NB. delete from client where Factor='{tbl}';
   tmplt=. 0 : 0
select NOW();
create table {tbl}
(   Date DATE,
    SEDOL varchar(7),
    Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8;
insert into {tbl} select Date,SEDOL,Value from client where Factor='{tbl}';
select NOW();
delete from client where Factor='{tbl}';
select NOW();
)

      $templates=. (<tmplt) rplc &.><"1 (<'{tbl}'),.factors
28

We generate 28 customized SQL statements by replacing the text in our template with the factor names. Let's check the first one.

   >0{templates
select NOW();
create table CFROICHG
(   Date DATE,
    SEDOL varchar(7),
    Value DOUBLE) Engine=InnoDB DEFAULT CHARSET=utf8;
insert into CFROICHG select Date,SEDOL,Value from client where Factor='CFROICHG';
select NOW();
delete from client where Factor='CFROICHG';
select NOW();

The SQL we generate incorporates “NOW()” statements to show the current time before and after each step as a crude way to figure out how long each part of the process takes.

Advanced topics

[1].

Learning and Teaching J

Materials

Double brackets allow abbreviated (local) links but require a pipe to separate the associated text, like this: Building a GUI in J

  1. Example footnote

the monadic verb +x actually produces the complex conjugate of x.