Data processing

From J Wiki
Jump to navigation Jump to search


<=   =>

Applications typically have a GUI part and a data processing (DP) part. The DP part is the actual calculations and data manipulation. A good application implementation will be modular and this implies a clear distinction between the GUI and the DP parts.

In this section you will develop the DP part of a simple application. In the next section you will develop the GUI part.

The DP part of the application is specified as follows:

The input is the name of a text file. The output is a string that displays as a table that contains: the file name, a count of lines, a count of characters, and a row for each distinct character in the file and a count of how many times it appears in the file. The rows of distinct characters should be sorted by their counts.

You'll be working with files, so load the file utilities.

   load 'files'


Create a simple text file to use as test data.

   fn =. jpath '~user/text.txt' NB. ~user/ is required for path to place file in user folder
   data =. 'abc' , LF, 'bc' , LF, 'b' , LF
   data fwrite fn
9
   fread fn
abc
bc
b


You need to define a verb report that takes a filename as an argument and returns the specified result. You'll build pieces of the definition in the Terminal window and then put them all together into the definition in a script.

The input is a filename and in the report verb it will have the name y , so start by working with y in the Term window.

   y =. jpath '~user/text.txt'


Read the file.

   d =. fread y


The report will have two columns. The first column will be the labels 'File:', 'Lines:', 'Chars:', and each distinct character in the file. The second column will be the value for that row. Since the data is a mixture of text and numbers it makes sense to build the result as boxed data.

Create a noun with the fixed labels.

   r =. 'File:' ; 'Lines:' ; 'Chars:'
   r
+-----+------+------+
|File:|Lines:|Chars:|
+-----+------+------+


The values for those labels are calculated as follows:

   y ; (+/ LF = d) ; #d
+-------------+-+-+
|user\text.txt|3|9|
+-------------+-+-+


The dyad ,. (stitch) can connect these two lists into a table.

   r =. r ,.  y ; (+/ LF = d) ; #d
   r
+------+-------------+
|File: |user\text.txt|
+------+-------------+
|Lines:|3            |
+------+-------------+
|Chars:|9            |
+------+-------------+


The next thing is to add the rows with the characters and their frequency counts. The letter is the label and the count is the value, so it just adds more items to r. Let's postpone that part of the problem, and work instead on converting the boxed table to the string result required by the spec. Use a comment to mark the bit we are skipping over for now.

   NB. need to add frequency rows to r here


The numbers in the second column need to be converted to characters. The easiest way to do this is to convert the contents of each box to characters. The characters are already characters and are not affected, but any numbers will be converted.

   r  =. ":each r
   r
+------+-------------+
|File: |user\text.txt|
+------+-------------+
|Lines:|3            |
+------+-------------+
|Chars:|9            |
+------+-------------+


The display of r with all characters looks the same, but each box now contains characters.

The next step is interesting and the details are left for you to puzzle out. It adds a TAB after each label and an LF after each value. In the final result the TAB separates the label from its value, and the LF causes a new line for the next label. The boxed display shows the TAB and LF as blanks, but they really are in there.

   r =. r ,each"1 1 TAB;LF
   r
+-------+--------------+
|File:  |user\text.txt |
+-------+--------------+
|Lines: |3             |
+-------+--------------+
|Chars: |9             |
+-------+--------------+


The monad ; (raze) opens all the boxes and assembles a string result.

   ; r
File: 	user\text.txt
Lines:	3
Chars:	9


You are ready to define your verb report. Create a new script and save it as jpath '~user\textdp.ijs'. Putting together the Term experiments, add the following definition for report to the script.

report =: 3 : 0
d =. fread jpath y
r =. 'File:' ; 'Lines:' ; 'Chars:'
r =. r ,. y ; (+/ LF = d) ; #d
NB. need to add frequency rows to r here
r =. ":each r
r =. r ,each"1 1 TAB;LF
; r
)


Run the script and test report.

   report fn
File: 	user\text.txt
Lines:	3
Chars:	9


Now calculate the frequency rows. You need a verb freq that returns a table of boxes where the first column is the distinct characters and the second column is the count of times they are in the file. The argument to freq is the file data and inside freq it will have the name y , so let's start with y defined as the file data.

   y =. fread fn


The data can include TAB, CR, and LF characters and they should be removed. The dyad -. (less) can remove these unwanted characters.

   d =. y -. TAB,CR,LF
   d
abcbcb


The nubcount, defined as nubcount=: ~. ;"_1 #/.~, returns a table of boxes with a first column containing the distinct items in its argument and the second column containing the counts.

 
   nubcount=: ~. ;"_1 #/.~  
   nc =. nubcount d
   nc
+-+-+
|a|1|
+-+-+
|b|3|
+-+-+
|c|2|
+-+-+


To sort the items by the counts you need to get the counts into a list.

   > 1 {"1 nc 
1 3 2


The dyad \: (sort down) sorts the items of its left argument based its right argument.

   nc \: > 1 {"1 nc
+-+-+
|b|3|
+-+-+
|c|2|
+-+-+
|a|1|
+-+-+


Put this all together and add the following definition to your script.

freq =: 3 : 0
d =. y -. TAB,CR,LF
nc =. nubcount d
nc \: > 1 {"1 nc
)
nubcount=: ~. ;"_1 #/.~


Run the script and test freq.

   freq fread fn
+-+-+
|b|3|
+-+-+
|c|2|
+-+-+
|a|1|
+-+-+


You can now use freq in your report verb. Modify the NB. comment line in report to be:

r =. r , freq d


Run the script and test report.

   report fn
File:	user\text.txt
Lines:	3
Chars:	9
b	3
c	2
a	1


Try it on other text files.

You have finished the data processing part.

<=   =>

Primer Index               Hover to reveal titles   -   Click to access   -   Current page is highlighted
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
45 46 47 48
50 51 52 53 54 55 56 57
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
97 98 99 100 101 102 103 104 105 106