Fifty Shades of J/Chapter 29

From J Wiki
Jump to navigation Jump to search

Table of Contents ... Glossary ... Previous Chapter ... Next Chapter

Just say it in J - ANOVA

Principal Topics

# (copy) /. (key) ; (raze) ANOVA, between/within groups sums of squares, main effects, treatment/residual sums of squares, Latin square, orthogonality, interactions.

An early claim for APL was that the exactness demanded by expressing mathematics in an executable language could lead to significantly greater understanding of underlying algebra and algorithms than was achievable by conventional mathematical notation. Subsequent experience diluted this claim somewhat since the obscurity which even a few lines of code could generate outweighed the inherent discipline of using a programming language as a learning tool. Nevertheless there are some topics in which the expository power of APL, and now even more strongly J, trounces traditional methods. One of the most conspicuous examples of this is the manner in which sums of squares are partitioned as the first stage of the statistical technique known as Analysis of Variance.

The acronym ANOVA, which is at least as well known by its abbreviation as in its full form, is one of the curiosities of statistical terminology, since it is not a single technique, it is not primarily about variance, and it cannot properly be said to be a process of analysis. I surmise that many generations of students have tackled ANOVA as an arcane piece of algebra involving mysterious ‘correction factors’, seemingly arbitrary rules concerning denominators, and mantras about ‘degrees of freedom’, ‘interactions’ and ‘error sums of squares’. Yet if ANOVA is seen as systematic set of transformations of data which all concerned portioning of sums of squares, then its rationale and the algebra itself become plain. I am not trying to underestimate the vast scope of the intellectually demanding science of Design of Experiments which ANOVA leads up to, but rather to suggest that its foundations might be more clearly introduced through the medium of J.

The insight to be obtained from using J assume competence in that language, so that as a starting point the following basic verbs can be taken for granted :

mean=: +/ % #     NB. Mean of a list
mdev=: - mean     NB. Deviations from mean
ssq=: +/@:*:      NB. Sum of squares of a list
ssqm=: ssq@mdev   NB. Sum of squares about the mean

One-way Analysis of Variance

One-way ANOVA is relevant where there are groups of numbers, not necessarily of equal size, but of sufficient general similarity for it to be interesting to speculate on whether the groups are distinguishably different, or whether any variability present is scattered across all the groups. In order to allow the calculations to be checked by mental arithmetic the values used in the following examples are integers. A set of groups of numbers is modelled by a list of lists such as

   data1=: 1 2 3 6 8;4 8 3;3 4 9 5
   mean every data1
4 5 5.25

A common sense way of assessing whether this result shows real differences between groups is to hypothesise no variability within each group, and smooth out any such variability by replacing every value by its group mean. J obliges with the primitive verb copy which, operating as the centre tine of a fork, makes as many copies of the numbers as are necessary:

   ((# every) # (mean every)) data1
4 4 4 4 4 5 5 5 5.25 5.25 5.25 5.25

(There are redundant parentheses in the above expression in order to make the fork structure stand out.) The sum of squares of these values about the overall mean is totally free of any within-group variation and so, rationally, is a sound measure of the between-group variation:

   bss=: ssqm@(# every # mean every)  NB. between-group sum of squares
   bss data1
3.91667

For variability within groups, the procedure is to sum the individual sums of squares about the mean within each group:

   wss=: +/@:(ssqm every)             NB. within-group sum of squares
   wss data1
68.75

Given the above results, any subsequent analysis of this particular data is clear cut, viz. the within-group variability overwhelms the between-group variability. As a happy consequence of algebra, which can be investigated separately, it happens that the sum of bss and wss necessarily equals the sum of squares about the mean which would have arisen with the same set of numbers if there had been no group divisions in the first place. Using raze, this is:

   tss=: ssqm@;                       NB. Total sum of squares
   tss data1
72.6667

Hence

   aov1=: bss , wss , tss             NB. One-way ANOVA
   aov1 data1
3.91667 68.75 72.6667

Here is a consolidation of the verbs which describe this first stage in understanding ANOVA :

bss=: ssqm@(# every # mean every)     NB. Between-group sum of squares
wss=: +/@:(ssqm every)                NB. Within-group sum of squares
tss=: ssqm@;                          NB. Total sum of squares
aov1=: bss,wss,tss                    NB. One-way ANOVA

Two-way Analysis of Variance

When data is classified in more than one way, e.g. by the rows and columns of a rectangular matrix, the rows can be boxed separately to give the between rows sums of squares of the rows, and the same process for the transposed matrix gives the between column sums of squares.

   ]data2=: 6 2 3 , 2 8 5 ,: 5 6 8
6 2 3
2 8 5
5 6 8
   rowssq=: bss@:(<"1)    NB. row sums of squares
   (rowssq data2),(rowssq |:data2)
10.6667 2

Now successively subtract row and column means from the matrix and sum the squares of the result :

   rowdevs=: mdev every@(<"1)
   ssq ,rowdevs |:rowdevs data2
29.3333

This value is the sum of the squares after any row and column effects have been removed. If an interaction effect between rows and columns is present, that is the occurrence in a particular cell of one of the experimental units is also dependent on its row and column combination, then 29.3333 is the sum of squares attributable to this factor. If an interaction assumption is not reasonable, 29.3333 must be accounted for as random variation and is called the residual sum of squares. Since the interaction assumption is stronger define

   iss=: monad : 'ssq,rowdevs |:rowdevs y'
   aov2=: (rowssq every @; |:) , iss , tss
   aov2 data2
10.6667 2 29.3333 42

As a check the first three items in the above must be equal to the last.

For square matrices it is possible to superimpose a further classification, often called treatments, based on a so-called Latin square design such as

0 1 2
2 0 1
1 2 0

Using the ravel of this matrix as the basis of box key to obtain a between-treatments sum of squares :

   x=: 0 1 2 2 0 1 1 2 0
   x </. ,data2
┌─────┬─────┬─────┐
│6 8 8│2 5 5│3 2 6│
└─────┴─────┴─────┘
   bss x </. ,data2
24.6667

which accounts for a further 24.6667 of the previous 29.3333 sum of squares, assuming these were considered as residual.

The property that the sums of row, column and treatment sums of squares is the total sum of squares depends on orthogonality in the design. For example an arbitrary arrangement of three 0s, three 1s, and three 2s such as

   x1=: 1 0 2 2 0 1 0 1 2

does not possess this property.

Three-way Analysis of Variance

Where a 2-way experiment is replicated either by simple repetition, or by applying different treatments to each replicate, the data is best presented as a 3-dimensional matrix, for example data3 which has 4 rows and 5 columns, and within each of the resulting 20 cells each of two treatments are applied :

   data3=: 5 3 3 $ ; <@".;._2 noun define
130  34  20
150 136  25
138 174  96

155  40  70
188 122  70
110 120 104

 74  80  82
159 106  58
168 150  82

180  75  58
126 115  45
160 139  60
)

Consolidate the data by +/data3, apply aov2 and divide by 4 to take account of the fact that each new cell is a sum of 4 original ones. The resulting sum of squares is less than that in the full data because +/ has eliminated the variation due purely to repetition. A full ANOVA for data3 is thus given by aov2 adapted to aov2r (ANOVA with repetition)

   aov2r=: monad define
u=. }: (aov2 +/y) % {. $y
u , (t - +/ ,u) , t=. tss y
)
   aov2r data3
10683.7 39118.7 9613.78 18230.8 77647

The five values in this list therefore stand for sums of squares for rows, columns, interaction, residual and total, and as before the sum of the first four must equal the fifth.

Now suppose that the several vales in each row/cell result from four different treatments, so that there are now three main effects (rows, columns and treatments) with the possibility of three interactions (RC, RT and CT). Averaging out the reduced effect is accomplished by +/ qualified by the three possible ranks, and divided by the appropriate shape element

   (aov2 +/ data3) % 4
10683.7 39118.7 9613.78 59416.2
   (aov2 +/"1 data3) % 3
354.972 10683.7 5358.94 16397.6
   (aov2 +/"2 data3) % 3
354.972 39118.7 3678.61 43152.3

All the values required are present but clearly there is some redundancy. The last but one values in the above lists represent the sums of squares due to interactions, the main effects appear twice over, and the residual sum of squares (or equivalently the 3-way interaction effect) is most easily found by subtracting all the main and 2-way interactions from the total sum of squares. All of this is sorted out in

   aov3=: monad define
t=. }:"1 (aov2 every (+/y) ; (+/"2 y) ; +/"1 y) % $y
t=. (0 1 3 { ,t) , {:"1 t
t , (u - +/t) , u=. tss,y
)
   aov3 data3
10683.7 39118.7 354.972 9613.78 3678.61 5358.94 8838.22 77647

which give in order row, column, treatment main effects, RC, CT and RT interactions, residual and total sums of squares. If required the residual sum of squares could be broken down further into three-way interactions by extending the existing technique, although from a substantive point of view such interactions are usually difficult to conceive.

The above development shows how the main J primitives match operations which are logically necessary steps for understanding the process of calculating values infor ANOVA tables. Each of the values in an ANOVA list bar the last provides a means of estimating the overall variance present in the data assuming that none of the effects are significant. If these estimates vary by too much (where ‘too much’ is determined by the so-called F-statistic) then the corresponding effect is deemed to be significant. This is the point at which raw calculation stops and statistical reasoning takes over.

Code Summary

Basic Statistical verbs

mean=: +/ % #                                     NB. Mean of list
mdev=: - mean                                     NB. Deviations from mean
ssq=: +/@:*:                                      NB. Sum of squares of a list
ssqm=: ssq@mdev                                   NB. Sum of squares about the mean

One-way Analysis of Variance (data is list of lists)

aov1=: bss , wss , tss                            NB. one-way ANOVA
bss=: ssqm@(# every # mean every)                 NB. between-group sum of squares
wss=: +/@:(ssqm every)                            NB. within-group sum of squares
tss=: ssqm@;                                      NB. total sum of squares

Two-way Analysis of Variance (data is a numerical matrix)

aov2=: monad define                               NB. two-way ANOVA
u=. ,rowssq every y ; |:y                           
r=. u , (t - +/ ,u) , t=. tss y                          
)                                                 
rowssq=: bss@:(<"1)                               NB. row sums of squares
rowdevs=: mdev every@(<"1)                        NB. remove row&col means

Three-way Analysis of Variance (data is a 3-D numeric array)

aov3=: monad define
t=. }:"1 (aov2 every (+/y) ; (+/"2 y) ; +/"1 y) % $y     
t=. (0 1 3 { ,t) , {:"1 t                             
t,(u - +/t) , u=. tss,y                               
)

Script

File:Fsojc29.ijs