Fifty Shades of J/Chapter 29
Table of Contents ... Glossary ... Previous Chapter ... Next Chapter
Principal Topics
- # (copy) /. (key) ; (raze) ANOVA, between/within groups sums of squares, main effects, treatment/residual sums of squares, Latin square, orthogonality, interactions.
An early claim for APL was that the exactness demanded by expressing mathematics in an executable language could lead to significantly greater understanding of underlying algebra and algorithms than was achievable by conventional mathematical notation. Subsequent experience diluted this claim somewhat since the obscurity which even a few lines of code could generate outweighed the inherent discipline of using a programming language as a learning tool. Nevertheless there are some topics in which the expository power of APL, and now even more strongly J, trounces traditional methods. One of the most conspicuous examples of this is the manner in which sums of squares are partitioned as the first stage of the statistical technique known as Analysis of Variance.
The acronym ANOVA, which is at least as well known by its abbreviation as in its full form, is one of the curiosities of statistical terminology, since it is not a single technique, it is not primarily about variance, and it cannot properly be said to be a process of analysis. I surmise that many generations of students have tackled ANOVA as an arcane piece of algebra involving mysterious ‘correction factors’, seemingly arbitrary rules concerning denominators, and mantras about ‘degrees of freedom’, ‘interactions’ and ‘error sums of squares’. Yet if ANOVA is seen as systematic set of transformations of data which all concerned portioning of sums of squares, then its rationale and the algebra itself become plain. I am not trying to underestimate the vast scope of the intellectually demanding science of Design of Experiments which ANOVA leads up to, but rather to suggest that its foundations might be more clearly introduced through the medium of J.
The insight to be obtained from using J assume competence in that language, so that as a starting point the following basic verbs can be taken for granted :
mean=: +/ % # NB. Mean of a list mdev=: - mean NB. Deviations from mean ssq=: +/@:*: NB. Sum of squares of a list ssqm=: ssq@mdev NB. Sum of squares about the mean
One-way Analysis of Variance
One-way ANOVA is relevant where there are groups of numbers, not necessarily of equal size, but of sufficient general similarity for it to be interesting to speculate on whether the groups are distinguishably different, or whether any variability present is scattered across all the groups. In order to allow the calculations to be checked by mental arithmetic the values used in the following examples are integers. A set of groups of numbers is modelled by a list of lists such as
data1=: 1 2 3 6 8;4 8 3;3 4 9 5 mean every data1 4 5 5.25
A common sense way of assessing whether this result shows real differences between groups is to hypothesise no variability within each group, and smooth out any such variability by replacing every value by its group mean. J obliges with the primitive verb copy which, operating as the centre tine of a fork, makes as many copies of the numbers as are necessary:
((# every) # (mean every)) data1 4 4 4 4 4 5 5 5 5.25 5.25 5.25 5.25
(There are redundant parentheses in the above expression in order to make the fork structure stand out.) The sum of squares of these values about the overall mean is totally free of any within-group variation and so, rationally, is a sound measure of the between-group variation:
bss=: ssqm@(# every # mean every) NB. between-group sum of squares bss data1 3.91667
For variability within groups, the procedure is to sum the individual sums of squares about the mean within each group:
wss=: +/@:(ssqm every) NB. within-group sum of squares wss data1 68.75
Given the above results, any subsequent analysis of this particular data is clear cut, viz. the within-group variability overwhelms the between-group variability. As a happy consequence of algebra, which can be investigated separately, it happens that the sum of bss and wss necessarily equals the sum of squares about the mean which would have arisen with the same set of numbers if there had been no group divisions in the first place. Using raze, this is:
tss=: ssqm@; NB. Total sum of squares tss data1 72.6667
Hence
aov1=: bss , wss , tss NB. One-way ANOVA aov1 data1 3.91667 68.75 72.6667
Here is a consolidation of the verbs which describe this first stage in understanding ANOVA :
bss=: ssqm@(# every # mean every) NB. Between-group sum of squares wss=: +/@:(ssqm every) NB. Within-group sum of squares tss=: ssqm@; NB. Total sum of squares aov1=: bss,wss,tss NB. One-way ANOVA
Two-way Analysis of Variance
When data is classified in more than one way, e.g. by the rows and columns of a rectangular matrix, the rows can be boxed separately to give the between rows sums of squares of the rows, and the same process for the transposed matrix gives the between column sums of squares.
]data2=: 6 2 3 , 2 8 5 ,: 5 6 8 6 2 3 2 8 5 5 6 8 rowssq=: bss@:(<"1) NB. row sums of squares (rowssq data2),(rowssq |:data2) 10.6667 2
Now successively subtract row and column means from the matrix and sum the squares of the result :
rowdevs=: mdev every@(<"1) ssq ,rowdevs |:rowdevs data2 29.3333
This value is the sum of the squares after any row and column effects have been removed. If an interaction effect between rows and columns is present, that is the occurrence in a particular cell of one of the experimental units is also dependent on its row and column combination, then 29.3333 is the sum of squares attributable to this factor. If an interaction assumption is not reasonable, 29.3333 must be accounted for as random variation and is called the residual sum of squares. Since the interaction assumption is stronger define
iss=: monad : 'ssq,rowdevs |:rowdevs y' aov2=: (rowssq every @; |:) , iss , tss aov2 data2 10.6667 2 29.3333 42
As a check the first three items in the above must be equal to the last.
For square matrices it is possible to superimpose a further classification, often called treatments, based on a so-called Latin square design such as
0 1 2 2 0 1 1 2 0
Using the ravel of this matrix as the basis of box key to obtain a between-treatments sum of squares :
x=: 0 1 2 2 0 1 1 2 0 x </. ,data2 ┌─────┬─────┬─────┐ │6 8 8│2 5 5│3 2 6│ └─────┴─────┴─────┘ bss x </. ,data2 24.6667
which accounts for a further 24.6667 of the previous 29.3333 sum of squares, assuming these were considered as residual.
The property that the sums of row, column and treatment sums of squares is the total sum of squares depends on orthogonality in the design. For example an arbitrary arrangement of three 0s, three 1s, and three 2s such as
x1=: 1 0 2 2 0 1 0 1 2
does not possess this property.
Three-way Analysis of Variance
Where a 2-way experiment is replicated either by simple repetition, or by applying different treatments to each replicate, the data is best presented as a 3-dimensional matrix, for example data3 which has 4 rows and 5 columns, and within each of the resulting 20 cells each of two treatments are applied :
data3=: 5 3 3 $ ; <@".;._2 noun define 130 34 20 150 136 25 138 174 96 155 40 70 188 122 70 110 120 104 74 80 82 159 106 58 168 150 82 180 75 58 126 115 45 160 139 60 )
Consolidate the data by +/data3, apply aov2 and divide by 4 to take account of the fact that each new cell is a sum of 4 original ones. The resulting sum of squares is less than that in the full data because +/ has eliminated the variation due purely to repetition. A full ANOVA for data3 is thus given by aov2 adapted to aov2r (ANOVA with repetition)
aov2r=: monad define u=. }: (aov2 +/y) % {. $y u , (t - +/ ,u) , t=. tss y ) aov2r data3 10683.7 39118.7 9613.78 18230.8 77647
The five values in this list therefore stand for sums of squares for rows, columns, interaction, residual and total, and as before the sum of the first four must equal the fifth.
Now suppose that the several vales in each row/cell result from four different treatments, so that there are now three main effects (rows, columns and treatments) with the possibility of three interactions (RC, RT and CT). Averaging out the reduced effect is accomplished by +/ qualified by the three possible ranks, and divided by the appropriate shape element
(aov2 +/ data3) % 4 10683.7 39118.7 9613.78 59416.2 (aov2 +/"1 data3) % 3 354.972 10683.7 5358.94 16397.6 (aov2 +/"2 data3) % 3 354.972 39118.7 3678.61 43152.3
All the values required are present but clearly there is some redundancy. The last but one values in the above lists represent the sums of squares due to interactions, the main effects appear twice over, and the residual sum of squares (or equivalently the 3-way interaction effect) is most easily found by subtracting all the main and 2-way interactions from the total sum of squares. All of this is sorted out in
aov3=: monad define t=. }:"1 (aov2 every (+/y) ; (+/"2 y) ; +/"1 y) % $y t=. (0 1 3 { ,t) , {:"1 t t , (u - +/t) , u=. tss,y ) aov3 data3 10683.7 39118.7 354.972 9613.78 3678.61 5358.94 8838.22 77647
which give in order row, column, treatment main effects, RC, CT and RT interactions, residual and total sums of squares. If required the residual sum of squares could be broken down further into three-way interactions by extending the existing technique, although from a substantive point of view such interactions are usually difficult to conceive.
The above development shows how the main J primitives match operations which are logically necessary steps for understanding the process of calculating values infor ANOVA tables. Each of the values in an ANOVA list bar the last provides a means of estimating the overall variance present in the data assuming that none of the effects are significant. If these estimates vary by too much (where ‘too much’ is determined by the so-called F-statistic) then the corresponding effect is deemed to be significant. This is the point at which raw calculation stops and statistical reasoning takes over.
Code Summary
Basic Statistical verbs
mean=: +/ % # NB. Mean of list mdev=: - mean NB. Deviations from mean ssq=: +/@:*: NB. Sum of squares of a list ssqm=: ssq@mdev NB. Sum of squares about the mean
One-way Analysis of Variance (data is list of lists)
aov1=: bss , wss , tss NB. one-way ANOVA bss=: ssqm@(# every # mean every) NB. between-group sum of squares wss=: +/@:(ssqm every) NB. within-group sum of squares tss=: ssqm@; NB. total sum of squares
Two-way Analysis of Variance (data is a numerical matrix)
aov2=: monad define NB. two-way ANOVA u=. ,rowssq every y ; |:y r=. u , (t - +/ ,u) , t=. tss y ) rowssq=: bss@:(<"1) NB. row sums of squares rowdevs=: mdev every@(<"1) NB. remove row&col means
Three-way Analysis of Variance (data is a 3-D numeric array)
aov3=: monad define t=. }:"1 (aov2 every (+/y) ; (+/"2 y) ; +/"1 y) % $y t=. (0 1 3 { ,t) , {:"1 t t,(u - +/t) , u=. tss,y )