Stories/MartinSaurer

Martin Saurer - Data Manipulation

During the last years, J became an important programming language / environment for me, to analyze and manipulate data.

Introduction:

When implementing a bidirectional interface, to integrate a 3D CAD application into a document management system (DMS) there are some nasty problems, and quite a lot of performance killers. A CAD/DMS - Interface must handle many document related processes (Create, Change, CheckIn/Out, etc.). One of these processes is to read a whole assembly structure, which can be best compared to the folder structure of a file system, then pass the data to the document creation process of the DMS. Some customers want to load 3D CAD assemblies with about 200’000 components. A midsized (every-day) assembly contains about 70’000 components. The underlying architecture of the 3rd party CAD/DMS - Interface, is designed for assemblies with a number of about 10’000 components, which is sufficient for most customers, but becomes unusable (performance) when dealing with really large assemblies.

The Problem:

To read an assembly structure with 70’000 components, eliminate double file entries (one document record is created for one file), and write the optimized data into a database table (MS Access), the current solution took several hours to complete. The elimination process of double file entries, using a single SQL command was the biggest performance killer. We found a way to omit this SQL command, so the processing time was reduced to about 30 minutes. Not perfect, but ok for a first shot. But the 30 minutes of processing time, was still unacceptable for some users.

The performance killer:

DELETE FROM MultilevelStructure AS A WHERE
A.ItemIndex > (SELECT MIN(B.ItemIndex) FROM MultilevelStructure AS B WHERE
A.DocFile = B.DocFile AND
((A.ParDocFile = B.ParDocFile) OR (A.ParDocFile IS NULL AND B.ParDocFile IS NULL)) AND
A.ConfigName = B.ConfigName AND
A.ItemIndex <> B.ItemIndex;

The Solution:

So I decided to give J a try. Instead of using the API of the CAD application, to read the assembly structure, we used a function to write the whole assembly structure into a text file. This process completes in about just 1 minute (for about 70’000 components). Next step was to read and parse the input file to build an array, eliminate double file entries, and write the whole stuff to an MS Access database table using ODBC, for further processing by the DMS. The DMS itself does not rely on MS Access, but Access is used as a data buffer to temporarily store some data.

The Input:

Modellname                       BENENNUNG
---------------------------------------------
 ANLAGE.ASM                      Gesamtmodell
   PLAN_2012-03.ASM              Plan Kubatur
     PLAN_2012-03_SK0001.PRT     .
     Gruppe G_WS
       PLAN_2012-03_WS01.PRT     .
       PLAN_2012-03_WS02.PRT     .
     Gruppe G_R1
       PLAN_2012-03_R101.PRT     .
       …

The resulting J array:

+--------+-------+-----------------------+----------------+----------------------+--------+
¦DocIndex¦DocType¦DocFile                ¦DocParent       ¦DocText               ¦DocLevel¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦0       ¦A      ¦ANLAGE.ASM             ¦                ¦SVA Gesamtmodell      ¦0       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦1       ¦A      ¦PLAN_2012-03.ASM       ¦ANLAGE.ASM      ¦SVA Grubenplan Kubatur¦1       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦2       ¦P      ¦PLAN_2012-03_SK0001.PRT¦PLAN_2012-03.ASM¦.                     ¦2       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦3       ¦P      ¦Gruppe G_WS            ¦PLAN_2012-03.ASM¦                      ¦2       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦4       ¦P      ¦PLAN_2012-03_WS01.PRT  ¦Gruppe G_WS     ¦.                     ¦3       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦5       ¦P      ¦PLAN_2012-03_WS02.PRT  ¦Gruppe G_WS     ¦.                     ¦3       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦6       ¦P      ¦Gruppe G_R1            ¦PLAN_2012-03.ASM¦                      ¦2       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦7       ¦P      ¦PLAN_2012-03_R101.PRT  ¦Gruppe G_R1     ¦.                     ¦3       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+
¦…       ¦…      ¦…                      ¦…               ¦…                     ¦…       ¦
+--------+-------+-----------------------+----------------+----------------------+--------+

Read and parse the input file:

txt2tab =: 3 : 0

   NB. Initialization
   TextFile    =. y
   DocIndex    =. _1
   ParentStack =. ''
   LastLevel   =. _1

   NB. Read text file and cut magic cookie (1st 4 chars)
   txt =. fread TextFile

   NB. Split string tx2 into an array of lines
   lns =. crlf splitstring txt

   NB. Specify header indices of first line (SVA.txt)
   ix1 =. 0
   ix2 =. 'BENENNUNG' find >0{lns

   NB. Cut the first two items (header and ----)
   lns =. (2+i.((0{$lns)-2)){lns

   NB. Initialize header line
   AllRecords =. HeaderLine
   BufRecords =. ''
   OneRecord  =. ''

   NB. Loop over lines
   for_i. i. $lns do.
      NB. Take one line
      lin =. >i{lns
      NB. Check if it's not an empty line
      if. 0 < $lin do.
         NB. Chop it into its components
         box =. lin chop ix2,(($lin)-ix2)
         NB. Set field values
         DocIndex  =. DocIndex + 1
         DocType   =. 'P'
         DocFile   =. rtrim (1+i.((0{$>0{box)-1)){>0{box
         DocParent =. ' '
         DocText   =. trim >1{box
         DocLevel  =. ((DocFile e. ' ') i. 0) % 2
         NB. Set ItemType on File extension
         if. ('.asm' findi DocFile) < ($DocFile) do. DocType =. 'A' end.
         if. ('.drw' findi DocFile) < ($DocFile) do. DocType =. 'D' end.
         NB. Handle ItemLevel (in fact we build a structure on line indent)
         if. DocLevel > LastLevel do.
            ParentStack =. ParentStack push DocFile
            LastLevel   =. DocLevel
            DocParent   =. pick ParentStack
         else.
            if. DocLevel < LastLevel do.
               ParentStack =. ParentStack popp (1+LastLevel-DocLevel)
               ParentStack =. ParentStack push DocFile
               LastLevel   =. DocLevel
               DocParent   =. pick ParentStack
            else.
               ParentStack =. ParentStack popp 1
               ParentStack =. ParentStack push DocFile
               LastLevel   =. DocLevel
               DocParent   =. pick ParentStack
            end.
         end.
         NB. Trim necessary fields
         DocFile   =. trim DocFile
         DocParent =. trim DocParent
         NB. Check DocParent and DocText
         NB.if. DocParent -: '' do. DocParent =. ' ' end.
         NB.if. DocText   -: '' do. DocText   =. ' ' end.
         DocParent =. DocParent,(0<-.#DocParent)$' '
         DocText   =. DocText  ,(0<-.#DocText  )$' '
         NB. Build OneRecord, add it to BufRecords
         OneRecord =. DocIndex;DocType;DocFile;DocParent;DocText;DocLevel
         NB. Add OneRecord to BufRecords
         BufRecords =. BufRecords,OneRecord
         NB. We build 1000 records, then add them to the whole table
         NB. This is much faster, than adding records one by one,
         NB. because of memory management issues
         if. (1000 | i) = 0 do.
            AllRecords =. AllRecords,BufRecords
            BufRecords =. ''
         end.
      end.
   end.
   NB. Add last records
   AllRecords =. AllRecords,BufRecords
   AllRecords =. ((($AllRecords)%($HeaderLine)),($HeaderLine))$AllRecords
)

Eliminating double file entries:

dubl =: 4 : '~.,.(x){" 1 y'
remdub =: 3 : 0
   NB. y is the source array (table)
   iarr =. y
   NB. Remove double entries
   sarr =. 2 dubl iarr
   NB. Get 2nd column of input array
   fils =. 2{|:iarr
   NB. Get indices of existing elements of sarr in fils
   narr =. fils i. sarr
   NB. Finally extract records from iarr
   oarr =. (,/narr){iarr
)

Of course, these J sentences do not exactly the same like the SQL command, but it’s a sufficient way to remove double-file-entries for further processing. This J word completes in less than 1 second on a table array with 70’000 lines.

Write the J array into an Access database using ODBC:

tab2mdb =: 3 : 0
   NB. Input array (iarr) in y
   iarr =. y
   NB. Remove header line
   iarr =. (1+i.(0{(($iarr)-1))){iarr
   NB. Connect to database
   dbco =. ddcheck ddcon 'dsn=DataBase'
   NB. Delete contents of Table
   'delete from DocStruct' ddsql dbco
   NB. Loop over array and insert records
   for_i. i. (0{$iarr) do.
      irec =. i{iarr
      sqlt =. 'insert into DocStruct ('
      sqlt =. sqlt,(',' join HeaderLine)
      sqlt =. sqlt,') values ('
      sqlt =. sqlt,      (": >0{irec),  ','
      sqlt =. sqlt,'''', (   >1{irec),''','
      sqlt =. sqlt,'''', (   >2{irec),''','
      sqlt =. sqlt,'''', (   >3{irec),''','
      sqlt =. sqlt,'''', (   >4{irec),''','
      sqlt =. sqlt,      (": >5{irec),  ')'
      sqlt ddsql dbco
   end.
   NB. Release memory
   empty dddis dbco
)

The whole read-and-analyze process now, requires about 3 minutes (70’000 components), compared to 30 minutes before. Using J, time consumption for this read-and-analyze process was reduced by a factor of 10!

Conclusion:

You may ask now, what’s wrong with Access and SQL? Nothing so far, but we found that processing tables using an SQL command like the one above, works great with a number of records less than 10’000. The time consumption when processing more than 10’000 records increases exponentially. Maybe on larger database systems like PostgreSQL, MySQL, Oracle, SQL Server, etc, such SQL commands will complete in a few seconds. I don’t know, because I never tried it. In this case, the MS Access database cannot be replaced with another one. So the J solution is the best, I found so far.

Email: <<MailTo(martin DOT saurer AT SPAMFREE bluewin DOT ch)>>