Stories/MartinSaurer
Martin Saurer - Data Manipulation
During the last years, J became an important programming language / environment for me, to analyze and manipulate data.
Introduction:
When implementing a bidirectional interface, to integrate a 3D CAD application into a document management system (DMS) there are some nasty problems, and quite a lot of performance killers. A CAD/DMS - Interface must handle many document related processes (Create, Change, CheckIn/Out, etc.). One of these processes is to read a whole assembly structure, which can be best compared to the folder structure of a file system, then pass the data to the document creation process of the DMS. Some customers want to load 3D CAD assemblies with about 200’000 components. A midsized (every-day) assembly contains about 70’000 components. The underlying architecture of the 3rd party CAD/DMS - Interface, is designed for assemblies with a number of about 10’000 components, which is sufficient for most customers, but becomes unusable (performance) when dealing with really large assemblies.
The Problem:
To read an assembly structure with 70’000 components, eliminate double file entries (one document record is created for one file), and write the optimized data into a database table (MS Access), the current solution took several hours to complete. The elimination process of double file entries, using a single SQL command was the biggest performance killer. We found a way to omit this SQL command, so the processing time was reduced to about 30 minutes. Not perfect, but ok for a first shot. But the 30 minutes of processing time, was still unacceptable for some users.
The performance killer:
DELETE FROM MultilevelStructure AS A WHERE A.ItemIndex > (SELECT MIN(B.ItemIndex) FROM MultilevelStructure AS B WHERE A.DocFile = B.DocFile AND ((A.ParDocFile = B.ParDocFile) OR (A.ParDocFile IS NULL AND B.ParDocFile IS NULL)) AND A.ConfigName = B.ConfigName AND A.ItemIndex <> B.ItemIndex;
The Solution:
So I decided to give J a try. Instead of using the API of the CAD application, to read the assembly structure, we used a function to write the whole assembly structure into a text file. This process completes in about just 1 minute (for about 70’000 components). Next step was to read and parse the input file to build an array, eliminate double file entries, and write the whole stuff to an MS Access database table using ODBC, for further processing by the DMS. The DMS itself does not rely on MS Access, but Access is used as a data buffer to temporarily store some data.
The Input:
Modellname BENENNUNG --------------------------------------------- ANLAGE.ASM Gesamtmodell PLAN_2012-03.ASM Plan Kubatur PLAN_2012-03_SK0001.PRT . Gruppe G_WS PLAN_2012-03_WS01.PRT . PLAN_2012-03_WS02.PRT . Gruppe G_R1 PLAN_2012-03_R101.PRT . …
The resulting J array:
+--------+-------+-----------------------+----------------+----------------------+--------+ ¦DocIndex¦DocType¦DocFile ¦DocParent ¦DocText ¦DocLevel¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦0 ¦A ¦ANLAGE.ASM ¦ ¦SVA Gesamtmodell ¦0 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦1 ¦A ¦PLAN_2012-03.ASM ¦ANLAGE.ASM ¦SVA Grubenplan Kubatur¦1 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦2 ¦P ¦PLAN_2012-03_SK0001.PRT¦PLAN_2012-03.ASM¦. ¦2 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦3 ¦P ¦Gruppe G_WS ¦PLAN_2012-03.ASM¦ ¦2 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦4 ¦P ¦PLAN_2012-03_WS01.PRT ¦Gruppe G_WS ¦. ¦3 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦5 ¦P ¦PLAN_2012-03_WS02.PRT ¦Gruppe G_WS ¦. ¦3 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦6 ¦P ¦Gruppe G_R1 ¦PLAN_2012-03.ASM¦ ¦2 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦7 ¦P ¦PLAN_2012-03_R101.PRT ¦Gruppe G_R1 ¦. ¦3 ¦ +--------+-------+-----------------------+----------------+----------------------+--------+ ¦… ¦… ¦… ¦… ¦… ¦… ¦ +--------+-------+-----------------------+----------------+----------------------+--------+
Read and parse the input file:
txt2tab =: 3 : 0 NB. Initialization TextFile =. y DocIndex =. _1 ParentStack =. '' LastLevel =. _1 NB. Read text file and cut magic cookie (1st 4 chars) txt =. fread TextFile NB. Split string tx2 into an array of lines lns =. crlf splitstring txt NB. Specify header indices of first line (SVA.txt) ix1 =. 0 ix2 =. 'BENENNUNG' find >0{lns NB. Cut the first two items (header and ----) lns =. (2+i.((0{$lns)-2)){lns NB. Initialize header line AllRecords =. HeaderLine BufRecords =. '' OneRecord =. '' NB. Loop over lines for_i. i. $lns do. NB. Take one line lin =. >i{lns NB. Check if it's not an empty line if. 0 < $lin do. NB. Chop it into its components box =. lin chop ix2,(($lin)-ix2) NB. Set field values DocIndex =. DocIndex + 1 DocType =. 'P' DocFile =. rtrim (1+i.((0{$>0{box)-1)){>0{box DocParent =. ' ' DocText =. trim >1{box DocLevel =. ((DocFile e. ' ') i. 0) % 2 NB. Set ItemType on File extension if. ('.asm' findi DocFile) < ($DocFile) do. DocType =. 'A' end. if. ('.drw' findi DocFile) < ($DocFile) do. DocType =. 'D' end. NB. Handle ItemLevel (in fact we build a structure on line indent) if. DocLevel > LastLevel do. ParentStack =. ParentStack push DocFile LastLevel =. DocLevel DocParent =. pick ParentStack else. if. DocLevel < LastLevel do. ParentStack =. ParentStack popp (1+LastLevel-DocLevel) ParentStack =. ParentStack push DocFile LastLevel =. DocLevel DocParent =. pick ParentStack else. ParentStack =. ParentStack popp 1 ParentStack =. ParentStack push DocFile LastLevel =. DocLevel DocParent =. pick ParentStack end. end. NB. Trim necessary fields DocFile =. trim DocFile DocParent =. trim DocParent NB. Check DocParent and DocText NB.if. DocParent -: '' do. DocParent =. ' ' end. NB.if. DocText -: '' do. DocText =. ' ' end. DocParent =. DocParent,(0<-.#DocParent)$' ' DocText =. DocText ,(0<-.#DocText )$' ' NB. Build OneRecord, add it to BufRecords OneRecord =. DocIndex;DocType;DocFile;DocParent;DocText;DocLevel NB. Add OneRecord to BufRecords BufRecords =. BufRecords,OneRecord NB. We build 1000 records, then add them to the whole table NB. This is much faster, than adding records one by one, NB. because of memory management issues if. (1000 | i) = 0 do. AllRecords =. AllRecords,BufRecords BufRecords =. '' end. end. end. NB. Add last records AllRecords =. AllRecords,BufRecords AllRecords =. ((($AllRecords)%($HeaderLine)),($HeaderLine))$AllRecords )
Eliminating double file entries:
dubl =: 4 : '~.,.(x){" 1 y' remdub =: 3 : 0 NB. y is the source array (table) iarr =. y NB. Remove double entries sarr =. 2 dubl iarr NB. Get 2nd column of input array fils =. 2{|:iarr NB. Get indices of existing elements of sarr in fils narr =. fils i. sarr NB. Finally extract records from iarr oarr =. (,/narr){iarr )
Of course, these J sentences do not exactly the same like the SQL command, but it’s a sufficient way to remove double-file-entries for further processing. This J word completes in less than 1 second on a table array with 70’000 lines.
Write the J array into an Access database using ODBC:
tab2mdb =: 3 : 0 NB. Input array (iarr) in y iarr =. y NB. Remove header line iarr =. (1+i.(0{(($iarr)-1))){iarr NB. Connect to database dbco =. ddcheck ddcon 'dsn=DataBase' NB. Delete contents of Table 'delete from DocStruct' ddsql dbco NB. Loop over array and insert records for_i. i. (0{$iarr) do. irec =. i{iarr sqlt =. 'insert into DocStruct (' sqlt =. sqlt,(',' join HeaderLine) sqlt =. sqlt,') values (' sqlt =. sqlt, (": >0{irec), ',' sqlt =. sqlt,'''', ( >1{irec),''',' sqlt =. sqlt,'''', ( >2{irec),''',' sqlt =. sqlt,'''', ( >3{irec),''',' sqlt =. sqlt,'''', ( >4{irec),''',' sqlt =. sqlt, (": >5{irec), ')' sqlt ddsql dbco end. NB. Release memory empty dddis dbco )
The whole read-and-analyze process now, requires about 3 minutes (70’000 components), compared to 30 minutes before. Using J, time consumption for this read-and-analyze process was reduced by a factor of 10!
Conclusion:
You may ask now, what’s wrong with Access and SQL? Nothing so far, but we found that processing tables using an SQL command like the one above, works great with a number of records less than 10’000. The time consumption when processing more than 10’000 records increases exponentially. Maybe on larger database systems like PostgreSQL, MySQL, Oracle, SQL Server, etc, such SQL commands will complete in a few seconds. I don’t know, because I never tried it. In this case, the MS Access database cannot be replaced with another one. So the J solution is the best, I found so far.
Email: <<MailTo(martin DOT saurer AT SPAMFREE bluewin DOT ch)>>