Help / Learning / Ch 18: Sets, Classes and Relations

From J Wiki
Jump to navigation Jump to search


>> << Pri JfC LJ Phr Dic Voc !: Rel NuVoc wd Help Learning J


Chapter 18: Sets, Classes and Relations

In this chapter we look at more of the built-in functions of J. The connecting theme is, somewhat loosely, working with set, classes and relations.

Suppose that, for some list, for the purpose at hand, the order of the items is irrelevant and the presence of duplicate items is irrelevant. Then we can regard the list as (representing) a finite set. In the abstract, the set 3 1 2 1 is considered to be the same set as 1 2 3.

The word "class" we will use in the sense in which, for example, each integer in a list belongs either to the odd class or to the even class.

By "relation" is meant a table of two or more columns, expressing a relationship between a value in one column and the corresponding value in another. A relation with two columns, for example, is a set of pairs.

18.1 Sets

18.1.1 Membership

There is a built-in verb e. (lowercase e dot, called "Member"). The expresssion x e. y tests whether x matches any item of y, that is, whether x is a member of the list y. For example:

y=: 'abcde' 'a' e. y 'w' e. y 'ef' e. y
abcde 1 0 1 0

Evidently the order of items in y is irrelevant and so is the presence of duplicates in y.

z=: 'edcbad' 'a' e. z 'w' e. z 'ef' e. z
edcbad 1 0 1 0

We can test whether a table contains a particular row:

t =: 4 2 $ 'abcdef' 'cd' e. t
ab

cd
ef

ab
1

18.1.2 Less

There is a built-in verb -. (minus dot, called "Less"). The expression x -. y produces a list of the items of x except those which are members of y.

x =: 'consonant' y =: 'aeiou' x -. y
consonant aeiou cnsnnt

Evidently the order of items in y is irrelevant and so is the presence of duplicates in y.

18.1.3 Nub

There is a built-in verb ~. (tilde dot, called "Nub"). The expression

~. y produces

a list of the items of y without duplicates.

nub =: ~. y =: 'hook' nub y
~. hook hok

We can apply nub to the rows of a table:

t nub t
ab

cd
ef

ab
ab

cd

ef

18.1.4 Nub Sieve

The verb "nub sieve" (~:) gives a boolean vector which is true only at the nub.

y b =: ~: y b # y nub y
hook 1 1 0 1 hok hok

18.1.5 Functions for Sets

The customary functions on sets, such as set-union, set-intersection or set-equality, are easily defined using the built-in functions available. For example two sets are equal if all members of one are members of the other, and vice versa.

   seteq =: *./ @: (e. , e.~)

1 2 3 seteq 3 1 2 1 1 2 3 seteq 1 2
1 0

18.2 The Table Adverb

Recall that the adverb / generates a verb; for example +/ is a verb which sums lists. More precisely, it is the monadic case of +/ which sums lists. The dyadic case of +/ generates a table. For example:

X =: 0 1 2 Y =: 3 4 5 6 Z =: X +/ Y
0 1 2 3 4 5 6 3 4 5 6

4 5 6 7

5 6 7 8

We see from this example that the verb + is applied to all possible combinations of an item of the left argument X with an item of the right argument Y.

Here is another example:

   'abc' =/ 'face'
0 1 0 0
0 0 0 0
0 0 1 0

The result shows, in the first row, the value of 'a' = 'face', in the second row the value of 'b' = 'face' and so on. If we are aiming (as in the examples above) to apply the verb to all

possible combinations of 

items from the left and items from the right, then there is a condition which must be satisfied.

The condition is this. First, recall from Chapter 07 that verbs have three ranks, monadic, left and right, shown by the expression verb b. 0 . The left rank of the verb here must be the same as the rank of an item of the left argument, and the right rank of the verb must be the same as the rank of an item of the right argument. This condition can be checked by a little function , an adverb:

   check =: 1 : 0
:
L =. 1 { u b. 0        NB. left  rank of verb u
R =. 2 { u b. 0        NB. right rank of verb u 
assert. L = # $ 0 { x  [ 'check rank of item of x'
assert. R = # $ 0 { y  [ 'check rank of item of y'
1  NB. OK
)
   

The condition is indeed satisfied in the examples above, because all the relevant ranks are zero:

   X  + check Y
1
   

Here now is an example where the condition is not satisfied. Suppose items of X and Y are scalars, and we aim to combine them by

forming  boxed pairs, for example with this verb:

fbp =: < @: , 3 fbp 4
<@:, +---+

|3 4|

+---+

A first attempt:

X Y X fbp / Y
0 1 2 3 4 5 6 +-------------+

|0 1 2 3 4 5 6|

+-------------+

This not the result we want, because, given X and Y, the left and right the ranks of fbp don't meet the condition, that is, they are not zero,

   fbp b. 0
_ _ _
   

and so the check will fail.

   X fbp check Y
|assertion failure
|   L=#$0{x['check rank of item of x'
   

The remedy is to specify the required ranks for the arguments of fbp .

   X fbp " 0 0 / Y
+---+---+---+---+
|0 3|0 4|0 5|0 6|
+---+---+---+---+
|1 3|1 4|1 5|1 6|
+---+---+---+---+
|2 3|2 4|2 5|2 6|
+---+---+---+---+

In summary, the general scheme is that if we have

             z =:  x f/ y

and f satisfies the rank condition (check above) , then z is a table such that

           (< i;j){z    =    (i{x) f (j{y)

That is, z contains all possible pairings of an item of x with an item of y.

18.3 Table Decorations

Since we have been considering tables, it might be useful here to offer one or two cookbook examples of labelling the

rows and columns of a table. 

Firstly, suppose the data is a table of numbers:

   ] data =: 3 3 $ i. 9
0 1 2
3 4 5
6 7 8
   

and that we aim to format the data with this function, which will give columns 8 digits wide:

   format =: 8j4 & ":
   

suppose that the labels required are:

   rowlabels =: 'first'; 'second' ; 'third'
   collabels =: 'small'; 'medium' ; 'large'
   

If we now compute these three values

   D =: < format data
   R =: < > rowlabels
   C =: < ; (_8 & {. ) (&. >) collabels   
   

(In the expression for C above, the value of_8 is for right-justifying the labels in a column-width of 8)

Now we can show the data with row-labels:

   R,D
+------+------------------------+
|first |  0.0000  1.0000  2.0000|
|second|  3.0000  4.0000  5.0000|
|third |  6.0000  7.0000  8.0000|
+------+------------------------+

or with column-labels:

   C ,: D
+------------------------+
|   small  medium   large|
+------------------------+
|  0.0000  1.0000  2.0000|
|  3.0000  4.0000  5.0000|
|  6.0000  7.0000  8.0000|
+------------------------+
   

or both

    2 2 $ a: , C , R ,D
+------+------------------------+
|      |   small  medium   large|
+------+------------------------+
|first |  0.0000  1.0000  2.0000|
|second|  3.0000  4.0000  5.0000|
|third |  6.0000  7.0000  8.0000|
+------+------------------------+
   

18.4 Classes

18.4.1 Self-Classify

Consider the problem of finding the counts of letters occurring in a string (the frequency-distribution of letters). Here is one approach.

We form a table testing each letter for equality with the nub.

y =: 'hook' nub y (nub y) =/ y
hook hok 1 0 0 0

0 1 1 0

0 0 0 1

The expression ((nub y) = / y) can be abbreviated as (= y). The monadic case of the built-in verb = is called "Self-classify").

y nub y (nub y) =/ y = y
hook hok 1 0 0 0

0 1 1 0

0 0 0 1
1 0 0 0

0 1 1 0

0 0 0 1

If we sum each row of = y we obtain the counts, in the order of the letters in the nub.

y = y +/ " 1 =y
hook 1 0 0 0

0 1 1 0

0 0 0 1
1 2 1

The counts can be paired with the letters of the nub:

y nub y (nub y) ;" 0 (+/ " 1 =y)
hook hok +-+-+

|h|1|
+-+-+
|o|2|
+-+-+
|k|1|

+-+-+

18.4.2 Classification Schemes

Gardeners classify soil-types as acid, neutral or alkaline, depending on the pH value. Suppose that a pH less than 6 is classed as acid, 6 to 7 is neutral, and more than 7 as alkaline. Here now is a verb to classify a pH value, returning A for acid, N for neutral and L for alkaline (or limy).

   classify =: ({ & 'ANL')  @: ((>: & 6) + (> & 7))

classify 6 classify 4.8 5.1 6 7 7.1 8
N AANNLL

The classify function we can regard as defining a classification scheme. The letters ANL, which are in effect names of classes, are called the keys of the scheme.

18.4.3 The Key Adverb

Given some data (a list, say), we can classify each item to produce a list of corresponding keys.

data =: 7 5 6 4 8 k =: classify data
7 5 6 4 8 NANAL

We can select and group together all the data in, say, class A (all the data with key A):

data k k = 'A' (k = 'A') # data
7 5 6 4 8 NANAL 0 1 0 1 0 5 4

Now suppose we wish to count the items in each class. That is, we aim to apply the monadic verb # separately to each group of items all of the same key. To do this we can use the built-in adverb /. (slash dot, called "Key").

data k =: classify data k # /. data
7 5 6 4 8 NANAL 2 2 1

For another example, instead of counting the members we could exhibit the members, by applying the box verb <.

data k =: classify data k < /. data
7 5 6 4 8 NANAL +---+---+-+

|7 6|5 4|8|

+---+---+-+

The verb we apply can discover for itself the class of each separate argument, by classifying the first member: Here the verb u produces a boxed list: the key and count:

   u =: (classify @: {.) ; #
   

data k =: classify data k u /. data
7 5 6 4 8 NANAL +-+-+

|N|2|
+-+-+
|A|2|
+-+-+
|L|1|

+-+-+

The general scheme for the "Key" adverb is as follows. In the expression x u /. y, we take y to be a list, and x is a list of keys of corresponding items of y according to some classification scheme, and u is the verb to be applied separately to each class. The scheme is:

              x u /. y    means   (= x) (u @ #) y

To illustrate:

   y =: 4 5 6 7 8 
   x =: classify y
   u =: <

y x = x (= x) (u @ #) y x u /. y
4 5 6 7 8 AANNL 1 1 0 0 0

0 0 1 1 0

0 0 0 0 1
+---+---+-+

|4 5|6 7|8|

+---+---+-+
+---+---+-+

|4 5|6 7|8|

+---+---+-+

We see that each row of =x selects items from y, and u is applied to this selection.

18.4.4 Letter-Counts Revisited

Recall the example of finding the counts of letters in a string.

y =: 'LETTUCE' = y (nub y) ; " 0 +/ "1 (= y)
LETTUCE 1 0 0 0 0 0 0

0 1 0 0 0 0 1
0 0 1 1 0 0 0
0 0 0 0 1 0 0

0 0 0 0 0 1 0
+-+-+

|L|1|
+-+-+
|E|2|
+-+-+
|T|2|
+-+-+
|U|1|
+-+-+
|C|1|

+-+-+

Here is a variation. We note that we have in effect a classification scheme where we have as many different classes as different letters: each letter is (the key of) its own class. Thus we can write an expression of the form y u /. y.

The applied verb u will see, each time, a list of letters, all the same. It counts them, with #, and takes the first, with {., to be a label for the class.

   u =: {. ; #

y = y y u /. y
LETTUCE 1 0 0 0 0 0 0

0 1 0 0 0 0 1
0 0 1 1 0 0 0
0 0 0 0 1 0 0

0 0 0 0 0 1 0
+-+-+

|L|1|
+-+-+
|E|2|
+-+-+
|T|2|
+-+-+
|U|1|
+-+-+
|C|1|

+-+-+

18.5 Relations

Suppose there are a number of publications, such as:

  • "Pigs" by Smith, on the subject of pigs
  • "Pets" by Brown, on cats and dogs
  • "Dogs" by Smith and James, on dogs

and we aim to catalog such publications. A suitable data structure for such a catalog might be a table relating authors to titles and another table relating titles to subjects. For example:

author title
Smith "Pigs"
Brown "Pets"
Smith "Dogs"
James "Dogs"
title subject
"Pigs" pigs
"Pets" dogs
"Pets" cats
"Dogs" dogs

Such tables we may call "relations". The order of the rows is not significant. Here,for the sake of simplicity, we will stick to relations with two columns.

Now we choose a representation for our relations. For a first approach, we choose tables of boxed strings. The authors-titles relation is:

   ]  AT  =: (". ;. _2) 0 : 0
'Smith'  ; 'Pigs'
'Brown'  ; 'Pets'
'Smith'  ; 'Dogs'
'James'  ; 'Dogs'
)
+-----+----+
|Smith|Pigs|
+-----+----+
|Brown|Pets|
+-----+----+
|Smith|Dogs|
+-----+----+
|James|Dogs|
+-----+----+

and the titles-subjects relation is:

   ] TS =: (". ;. _2) 0 : 0
'Pigs' ; 'pigs'
'Pets' ; 'cats'
'Pets' ; 'dogs'
'Dogs' ; 'dogs'
)
+----+----+
|Pigs|pigs|
+----+----+
|Pets|cats|
+----+----+
|Pets|dogs|
+----+----+
|Dogs|dogs|
+----+----+
   

18.5.1 Join of Relations

From the authors-titles relation AT and the titles-subjects relation TS we can compute an authors-subjects relation

showing which author has written a title on which subject.

We say that AT and TS are to be joined with respect to

titles, and we would expect the join to look like this:
+-----+----+
|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|
+-----+----+
   

The plan for this section is to look at a function for computing joins, then at an improved version, and then at the advantage of representing relations as tables of symbols rather than boxed strings. Finally we look at some performance comparisons.

A method is as follows. We consider all possible pairs consisting of a row at from table AT and a row ts from table TS. Each pair at,ts is of the form:

          author; title; title; subject

If title matches title, that is, item 1 matches item 2, then we extract author and subject, that is, items 0 and 3. Verbs for testing and extracting from at,ts pairs can be written as:

   test =: 1&{ =  2&{
   extr =: 0 3 & {

and these verbs can be plugged into a suitable conjunction to do the pairing. In writing this conjunction, we aim to avoid requiring the whole set of possible pairs to be present at the same time, since this set may be large. We also aim to avoid any duplicates in the result. Here is a first attempt.

   PAIR =: 2 : 0
:
z =.  0 0 $ ''
for_at. x do.
   for_ts.  y do.
     if. u at,ts do. z =. z, v at,ts  end.
   end.
end.
~. z
)

The join verb can now be written as:

   join =: test PAIR extr

and we see:

AT TS AT join TS
+-----+----+

|Smith|Pigs|
+-----+----+
|Brown|Pets|
+-----+----+
|Smith|Dogs|
+-----+----+
|James|Dogs|

+-----+----+
+----+----+

|Pigs|pigs|
+----+----+
|Pets|cats|
+----+----+
|Pets|dogs|
+----+----+
|Dogs|dogs|

+----+----+
+-----+----+

|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|

+-----+----+

The join verb as defined above is slow, because the test and extr verbs are applied to a single x,y pair at a time - they are scalar computations. Performance will be better if we can give these verbs as much data as possible to work on at one time. (This is a universal rule in J). Vector or array arguments are better. Here is a revised vector-oriented version of PAIR and join, which still avoids building the entire set of pairs.

   VPAIR =: 2 : 0
:
z =.  0 0 $ ''
for_at. x do.
      z =. z , |: v (#~"1  u) |: at , "1 y
end.
~. z
)
   
   vjoin  =: test VPAIR extr 
   

giving the same result as before:

AT join TS AT vjoin TS
+-----+----+

|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|

+-----+----+
+-----+----+

|Smith|pigs|
+-----+----+
|Brown|cats|
+-----+----+
|Brown|dogs|
+-----+----+
|Smith|dogs|
+-----+----+
|James|dogs|

+-----+----+

Representing relations as tables of boxed strings, as above, is less than efficient. For a repeated value, the entire string is repeated. Values are compared by comparing entire strings.

Now we look at another possibility. Rather than boxed strings, a relation can be represented by a table of symbols.

18.5.2 What are Symbols?

Symbols are for efficient computation with string data. Symbols are a distinct data-type, in the same way that characters, boxes and numbers are distinct data-types. A symbol is a scalar which identifies, or refers to, a string.

A symbol can be created by applying the built-in verb s: (lowercase s colon) to a boxed string.

   a  =: s: <'hello'

Now the variable a has a value of type symbol. We inspect this value in the usual way:

   a
`hello

and see that the value is displayed as the original string preceded by a left-quote. Even though a looks like a string when displayed, it is a scalar.

a $ a # $ a
`hello   0

The original string is stored in a data-structure, maintained automatically by the J system, called the symbol-table. Strings are not duplicated within the symbol-table. Hence if another symbol b is created from the same string as a, then b is equal to a.

a b =: s: <'hello' b = a
`hello `hello 1

Notice that the comparison is simple scalar equality, with no need to compare the original strings.

Our relations above can be converted to arrays of symbols, and joined as before.

sAT =: s: AT sTS =: s: TS sAT vjoin sTS
`Smith `Pigs

`Brown `Pets
`Smith `Dogs

`James `Dogs
`Pigs `pigs

`Pets `cats
`Pets `dogs

`Dogs `dogs
`Smith `pigs

`Brown `cats
`Brown `dogs
`Smith `dogs

`James `dogs

Symbols are lexicographically ordered to reflect the ordering of the original strings. Hence tables of symbols can be sorted:

sAT /:~ sAT
`Smith `Pigs

`Brown `Pets
`Smith `Dogs

`James `Dogs
`Brown `Pets

`James `Dogs
`Smith `Dogs

`Smith `Pigs

18.5.3 Measurements Compared

Here is a utility verb giving time in seconds to evaluate an expression, averaged over say 4 executions.

   time =: (8j5 & ":) @: (4 & (6!:2))

The examples of relations above are too small for meaningful performance measurements,

so we make larger relations

by replicating each say 100 times.

   AT  =: 100 $ AT
   TS  =: 100 $ TS
   sAT =: 100 $ sAT
   sTS =: 100 $ sTS

There are 4 cases to compare:

   t1 =: time 'AT  join  TS'   NB. scalar method, boxed strings
   t2 =: time 'sAT join  sTS'  NB. scalar method, symbols
   t3 =: time 'AT  vjoin TS'   NB. vector method, boxed strings
   t4 =: time 'sAT vjoin sTS'  NB. vector method, symbols

and we see:

   3 3 $ ' '; 'strings'; 'symbols';'scalar';t1;t2; 'vector';t3;t4
+------+--------+--------+
|      |strings |symbols |
+------+--------+--------+
|scalar| 1.06800| 0.02214|
+------+--------+--------+
|vector| 0.01449| 0.00124|
+------+--------+--------+

In Chapter 31 we will return to the topic of performance in computing join of relations.

18.5.4 Saving and Restoring the Symbol Table

Suppose that data is an array of symbols.

   ] data =: s: 2 2 $ 'hello'; 'blah';'blah';'goodbye'
`hello `blah   
`blah  `goodbye

For a symbol in data its original string ('hello' for example) is stored only in the symbol table, not in data itself. The original string is needed to display the value of the symbol.

Suppose that we write data to a file, aiming to read it back in a new session. At the beginning of a new session, the symbol table is empty. Thus we must save the symbol table from the earlier session, and reinstate it at the beginning of the new session.

First, here are two utility functions to save a value to a file and retrieve it. (See Chapter 27 and Chapter 28 for more about data in files.)

   save =: 4 : '(3!:1 x ) 1!:2 < y '
   retr =: 3 : '3!:2 (1!:1 < y )'

Save the data to a file named, say, data.xyz

   data save 'data.xyz'

The symbol table is not itself a variable, but the expression 0 s: 10 gives a value for it. We save this value to a file named, say, symtab.xyz

   (0 s: 10) save 'symtab.xyz'

Start a new J session. The symbol table is initially empty, so begin by reinstating it from the file saved in the earlier session:

   10 s: (retr 'symtab.xyz')
1
   

Now, with the correct symbol table in place, we can retrieve the array of symbols data from its file:

   DATA =: retr 'data.xyz'

and we see that the symbols are correctly interpreted:

   DATA
`hello `blah   
`blah  `goodbye

This is the end of Chapter 18


NEXT
Table of Contents
Index


The examples in this chapter were executed using J version 802 beta. This chapter last updated 12 Sep 2014
Copyright © Roger Stokes 2014. This material may be freely reproduced, provided that acknowledgement is made.


>> << Pri JfC LJ Phr Dic Voc !: Rel NuVoc wd Help Learning J