Help / Learning / Ch 18: Sets, Classes and Relations
>> << Pri JfC LJ Phr Dic Voc !: Rel NuVoc wd Help Learning J
|
Chapter 18: Sets, Classes and RelationsIn this chapter we look at more of the built-in functions of J. The connecting theme is, somewhat loosely, working with set, classes and relations. Suppose that, for some list, for the purpose at hand, the order of the items is irrelevant and the presence of duplicate items is irrelevant. Then we can regard the list as (representing) a finite set. In the abstract, the set 3 1 2 1 is considered to be the same set as 1 2 3. The word "class" we will use in the sense in which, for example, each integer in a list belongs either to the odd class or to the even class. By "relation" is meant a table of two or more columns, expressing a relationship between a value in one column and the corresponding value in another. A relation with two columns, for example, is a set of pairs. 18.1 Sets
18.1.1 MembershipThere is a built-in verb e. (lowercase e dot, called "Member"). The expresssion x e. y tests whether x matches any item of y, that is, whether x is a member of the list y. For example:
Evidently the order of items in y is irrelevant and so is the presence of duplicates in y.
We can test whether a table contains a particular row:
18.1.2 LessThere is a built-in verb -. (minus dot, called "Less"). The expression x -. y produces a list of the items of x except those which are members of y.
Evidently the order of items in y is irrelevant and so is the presence of duplicates in y. 18.1.3 NubThere is a built-in verb ~. (tilde dot, called "Nub"). The expression ~. y produces a list of the items of y without duplicates.
We can apply nub to the rows of a table:
18.1.4 Nub SieveThe verb "nub sieve" (~:) gives a boolean vector which is true only at the nub.
18.1.5 Functions for SetsThe customary functions on sets, such as set-union, set-intersection or set-equality, are easily defined using the built-in functions available. For example two sets are equal if all members of one are members of the other, and vice versa. seteq =: *./ @: (e. , e.~)
18.2 The Table AdverbRecall that the adverb / generates a verb; for example +/ is a verb which sums lists. More precisely, it is the monadic case of +/ which sums lists. The dyadic case of +/ generates a table. For example:
We see from this example that the verb + is applied to all possible combinations of an item of the left argument X with an item of the right argument Y. Here is another example: 'abc' =/ 'face' 0 1 0 0 0 0 0 0 0 0 1 0 The result shows, in the first row, the value of 'a' = 'face', in the second row the value of 'b' = 'face' and so on. If we are aiming (as in the examples above) to apply the verb to all possible combinations of items from the left and items from the right, then there is a condition which must be satisfied. The condition is this. First, recall from Chapter 07 that verbs have three ranks, monadic, left and right, shown by the expression verb b. 0 . The left rank of the verb here must be the same as the rank of an item of the left argument, and the right rank of the verb must be the same as the rank of an item of the right argument. This condition can be checked by a little function , an adverb:
check =: 1 : 0 : L =. 1 { u b. 0 NB. left rank of verb u R =. 2 { u b. 0 NB. right rank of verb u assert. L = # $ 0 { x [ 'check rank of item of x' assert. R = # $ 0 { y [ 'check rank of item of y' 1 NB. OK ) The condition is indeed satisfied in the examples above, because all the relevant ranks are zero: X + check Y 1 Here now is an example where the condition is not satisfied. Suppose items of X and Y are scalars, and we aim to combine them by forming boxed pairs, for example with this verb:
A first attempt:
This not the result we want, because, given X and Y, the left and right the ranks of fbp don't meet the condition, that is, they are not zero, fbp b. 0 _ _ _ and so the check will fail. X fbp check Y |assertion failure | L=#$0{x['check rank of item of x' The remedy is to specify the required ranks for the arguments of fbp . X fbp " 0 0 / Y +---+---+---+---+ |0 3|0 4|0 5|0 6| +---+---+---+---+ |1 3|1 4|1 5|1 6| +---+---+---+---+ |2 3|2 4|2 5|2 6| +---+---+---+---+ In summary, the general scheme is that if we have z =: x f/ y and f satisfies the rank condition (check above) , then z is a table such that (< i;j){z = (i{x) f (j{y) That is, z contains all possible pairings of an item of x with an item of y. 18.3 Table DecorationsSince we have been considering tables, it might be useful here to offer one or two cookbook examples of labelling the rows and columns of a table. Firstly, suppose the data is a table of numbers: ] data =: 3 3 $ i. 9 0 1 2 3 4 5 6 7 8 and that we aim to format the data with this function, which will give columns 8 digits wide: format =: 8j4 & ": suppose that the labels required are: rowlabels =: 'first'; 'second' ; 'third' collabels =: 'small'; 'medium' ; 'large' If we now compute these three values D =: < format data R =: < > rowlabels C =: < ; (_8 & {. ) (&. >) collabels (In the expression for C above, the value of_8 is for right-justifying the labels in a column-width of 8) Now we can show the data with row-labels: R,D +------+------------------------+ |first | 0.0000 1.0000 2.0000| |second| 3.0000 4.0000 5.0000| |third | 6.0000 7.0000 8.0000| +------+------------------------+ or with column-labels: C ,: D +------------------------+ | small medium large| +------------------------+ | 0.0000 1.0000 2.0000| | 3.0000 4.0000 5.0000| | 6.0000 7.0000 8.0000| +------------------------+ or both 2 2 $ a: , C , R ,D +------+------------------------+ | | small medium large| +------+------------------------+ |first | 0.0000 1.0000 2.0000| |second| 3.0000 4.0000 5.0000| |third | 6.0000 7.0000 8.0000| +------+------------------------+ 18.4 Classes
18.4.1 Self-ClassifyConsider the problem of finding the counts of letters occurring in a string (the frequency-distribution of letters). Here is one approach. We form a table testing each letter for equality with the nub.
The expression ((nub y) = / y) can be abbreviated as (= y). The monadic case of the built-in verb = is called "Self-classify").
If we sum each row of = y we obtain the counts, in the order of the letters in the nub.
The counts can be paired with the letters of the nub:
18.4.2 Classification SchemesGardeners classify soil-types as acid, neutral or alkaline, depending on the pH value. Suppose that a pH less than 6 is classed as acid, 6 to 7 is neutral, and more than 7 as alkaline. Here now is a verb to classify a pH value, returning A for acid, N for neutral and L for alkaline (or limy). classify =: ({ & 'ANL') @: ((>: & 6) + (> & 7))
The classify function we can regard as defining a classification scheme. The letters ANL, which are in effect names of classes, are called the keys of the scheme. 18.4.3 The Key AdverbGiven some data (a list, say), we can classify each item to produce a list of corresponding keys.
We can select and group together all the data in, say, class A (all the data with key A):
Now suppose we wish to count the items in each class. That is, we aim to apply the monadic verb # separately to each group of items all of the same key. To do this we can use the built-in adverb /. (slash dot, called "Key").
For another example, instead of counting the members we could exhibit the members, by applying the box verb <.
The verb we apply can discover for itself the class of each separate argument, by classifying the first member: Here the verb u produces a boxed list: the key and count: u =: (classify @: {.) ; #
The general scheme for the "Key" adverb is as follows. In the expression x u /. y, we take y to be a list, and x is a list of keys of corresponding items of y according to some classification scheme, and u is the verb to be applied separately to each class. The scheme is: x u /. y means (= x) (u @ #) y To illustrate: y =: 4 5 6 7 8 x =: classify y u =: <
We see that each row of =x selects items from y, and u is applied to this selection. 18.4.4 Letter-Counts RevisitedRecall the example of finding the counts of letters in a string.
Here is a variation. We note that we have in effect a classification scheme where we have as many different classes as different letters: each letter is (the key of) its own class. Thus we can write an expression of the form y u /. y. The applied verb u will see, each time, a list of letters, all the same. It counts them, with #, and takes the first, with {., to be a label for the class. u =: {. ; #
18.5 RelationsSuppose there are a number of publications, such as:
and we aim to catalog such publications. A suitable data structure for such a catalog might be a table relating authors to titles and another table relating titles to subjects. For example:
Such tables we may call "relations". The order of the rows is not significant. Here,for the sake of simplicity, we will stick to relations with two columns. Now we choose a representation for our relations. For a first approach, we choose tables of boxed strings. The authors-titles relation is: ] AT =: (". ;. _2) 0 : 0 'Smith' ; 'Pigs' 'Brown' ; 'Pets' 'Smith' ; 'Dogs' 'James' ; 'Dogs' ) +-----+----+ |Smith|Pigs| +-----+----+ |Brown|Pets| +-----+----+ |Smith|Dogs| +-----+----+ |James|Dogs| +-----+----+ and the titles-subjects relation is: ] TS =: (". ;. _2) 0 : 0 'Pigs' ; 'pigs' 'Pets' ; 'cats' 'Pets' ; 'dogs' 'Dogs' ; 'dogs' ) +----+----+ |Pigs|pigs| +----+----+ |Pets|cats| +----+----+ |Pets|dogs| +----+----+ |Dogs|dogs| +----+----+
18.5.1 Join of RelationsFrom the authors-titles relation AT and the titles-subjects relation TS we can compute an authors-subjects relation showing which author has written a title on which subject. We say that AT and TS are to be joined with respect to titles, and we would expect the join to look like this: +-----+----+ |Smith|pigs| +-----+----+ |Brown|cats| +-----+----+ |Brown|dogs| +-----+----+ |Smith|dogs| +-----+----+ |James|dogs| +-----+----+ The plan for this section is to look at a function for computing joins, then at an improved version, and then at the advantage of representing relations as tables of symbols rather than boxed strings. Finally we look at some performance comparisons. A method is as follows. We consider all possible pairs consisting of a row at from table AT and a row ts from table TS. Each pair at,ts is of the form: author; title; title; subject If title matches title, that is, item 1 matches item 2, then we extract author and subject, that is, items 0 and 3. Verbs for testing and extracting from at,ts pairs can be written as: test =: 1&{ = 2&{ extr =: 0 3 & { and these verbs can be plugged into a suitable conjunction to do the pairing. In writing this conjunction, we aim to avoid requiring the whole set of possible pairs to be present at the same time, since this set may be large. We also aim to avoid any duplicates in the result. Here is a first attempt. PAIR =: 2 : 0 : z =. 0 0 $ '' for_at. x do. for_ts. y do. if. u at,ts do. z =. z, v at,ts end. end. end. ~. z ) The join verb can now be written as: join =: test PAIR extr and we see:
The join verb as defined above is slow, because the test and extr verbs are applied to a single x,y pair at a time - they are scalar computations. Performance will be better if we can give these verbs as much data as possible to work on at one time. (This is a universal rule in J). Vector or array arguments are better. Here is a revised vector-oriented version of PAIR and join, which still avoids building the entire set of pairs. VPAIR =: 2 : 0 : z =. 0 0 $ '' for_at. x do. z =. z , |: v (#~"1 u) |: at , "1 y end. ~. z ) vjoin =: test VPAIR extr giving the same result as before:
Representing relations as tables of boxed strings, as above, is less than efficient. For a repeated value, the entire string is repeated. Values are compared by comparing entire strings. Now we look at another possibility. Rather than boxed strings, a relation can be represented by a table of symbols. 18.5.2 What are Symbols?Symbols are for efficient computation with string data. Symbols are a distinct data-type, in the same way that characters, boxes and numbers are distinct data-types. A symbol is a scalar which identifies, or refers to, a string. A symbol can be created by applying the built-in verb s: (lowercase s colon) to a boxed string. a =: s: <'hello' Now the variable a has a value of type symbol. We inspect this value in the usual way: a `hello and see that the value is displayed as the original string preceded by a left-quote. Even though a looks like a string when displayed, it is a scalar.
The original string is stored in a data-structure, maintained automatically by the J system, called the symbol-table. Strings are not duplicated within the symbol-table. Hence if another symbol b is created from the same string as a, then b is equal to a.
Notice that the comparison is simple scalar equality, with no need to compare the original strings. Our relations above can be converted to arrays of symbols, and joined as before.
Symbols are lexicographically ordered to reflect the ordering of the original strings. Hence tables of symbols can be sorted:
18.5.3 Measurements ComparedHere is a utility verb giving time in seconds to evaluate an expression, averaged over say 4 executions. time =: (8j5 & ":) @: (4 & (6!:2)) The examples of relations above are too small for meaningful performance measurements, so we make larger relations by replicating each say 100 times. AT =: 100 $ AT TS =: 100 $ TS sAT =: 100 $ sAT sTS =: 100 $ sTS There are 4 cases to compare: t1 =: time 'AT join TS' NB. scalar method, boxed strings t2 =: time 'sAT join sTS' NB. scalar method, symbols t3 =: time 'AT vjoin TS' NB. vector method, boxed strings t4 =: time 'sAT vjoin sTS' NB. vector method, symbols and we see: 3 3 $ ' '; 'strings'; 'symbols';'scalar';t1;t2; 'vector';t3;t4 +------+--------+--------+ | |strings |symbols | +------+--------+--------+ |scalar| 1.06800| 0.02214| +------+--------+--------+ |vector| 0.01449| 0.00124| +------+--------+--------+ In Chapter 31 we will return to the topic of performance in computing join of relations. 18.5.4 Saving and Restoring the Symbol TableSuppose that data is an array of symbols. ] data =: s: 2 2 $ 'hello'; 'blah';'blah';'goodbye' `hello `blah `blah `goodbye For a symbol in data its original string ('hello' for example) is stored only in the symbol table, not in data itself. The original string is needed to display the value of the symbol. Suppose that we write data to a file, aiming to read it back in a new session. At the beginning of a new session, the symbol table is empty. Thus we must save the symbol table from the earlier session, and reinstate it at the beginning of the new session. First, here are two utility functions to save a value to a file and retrieve it. (See Chapter 27 and Chapter 28 for more about data in files.) save =: 4 : '(3!:1 x ) 1!:2 < y ' retr =: 3 : '3!:2 (1!:1 < y )' Save the data to a file named, say, data.xyz data save 'data.xyz' The symbol table is not itself a variable, but the expression 0 s: 10 gives a value for it. We save this value to a file named, say, symtab.xyz (0 s: 10) save 'symtab.xyz' Start a new J session. The symbol table is initially empty, so begin by reinstating it from the file saved in the earlier session: 10 s: (retr 'symtab.xyz') 1 Now, with the correct symbol table in place, we can retrieve the array of symbols data from its file: DATA =: retr 'data.xyz' and we see that the symbols are correctly interpreted: DATA `hello `blah `blah `goodbye This is the end of Chapter 18 |
The examples in this chapter
were executed using J version 802 beta.
This chapter last updated 12 Sep 2014
Copyright © Roger Stokes 2014.
This material may be freely reproduced,
provided that acknowledgement is made.
>> << Pri JfC LJ Phr Dic Voc !: Rel NuVoc wd Help Learning J