Factorial ANOVA Designs

Permutation tests have been around statistical analysis for nearly 75 years, and a wide variety of tests have been constructed. You would probably think that there are all sorts of algorithms to set up permutation procedures for factorial designs, but you would be wrong. Although many people have discussed them, Anderson (2001) and Manly (2007) being two of the best discussions, when a factorial design includes a significant interaction things become sticky. Edgington and Onghena (2007) point out that an exact test for interaction can not be created. However there are good approximate tests that we can use. This document will cover proposals by a number of authorities.

As I indicated above, Anderson (2001) has one of the best discussions of this kind of test. She is certainly not the first, and her work derives from work by Edgington, Manly, and others, but she has pulled much of it together. I started by writing this page around her work, but found that it was a bit easier to write it around Manly's book, although you come out at the same place in the end. I am creating this page mainly for myself. I have written the necessary code in R to implement many of Manly's alternatives, but I feel a need to pull that stuff together and make sure that I understand what I am doing. So be patient!

I will start with an example taken directly from Manly (p. 144). It concerns the number of ants consumed by two sizes of lizards over each of four months. (Well, I was getting tired of examples from psychology, and who couldn't get excited by prying open the mouths of lizards and counting the number of ants they have just eaten.)

Small Large June 13 242 105 182 21 7 July 8 59 20 24 312 68 August 515 488 88 460 1223 990 September 18 44 21 140 40 27

The cell means are

Small Large June 120.00000 70.0000 July 29.00000 134.6667 August 363.66667 891.0000 September 27.66667 69.0000

and it looks as if we have the possibility of an interaction (In June the Small guys actually ate more ants that the Large guys, but the reverse happened in other months.) It also looks as if we have might have an effect for Size, with Large guys eating more ants than Small guys. (Who would have guessed?) In fact, the analysis of variance summary table for these data confirms that. The interaction is very borderline, as is the Size effect, but the Months effect is substantial.

The means for Size and Months follow, along with the ANOVA summary table.

Small Large [1,] 135.0833 291.1667

June July August September [1,] 95 81.83333 627.3333 48.33333

The standard ANOVA for these data follows: df Sum Sq Mean Sq Fvalue Pr(>F) size 1 146172 146172 4.4699 0.05055 months 3 1379495 459832 14.0615 9.487e-05 *** size:months 3 294009 98003 2.9969 0.06171 Residuals 16 523222 32701

I am assuming that the reader will know about permutation tests and the general procedures lying behind them. If you don't, I recommend going to www.uvm.edu/~dhowell/StatPages/index.html.

The basic idea with a *one-way*
Anova design is that if the null hypothesis is true, an observation in
Group 1 could just as easily have fallen in Group 2, and vice versa. So
we simply permute the observations across groups, calculate an *F*,
repeat this perhaps 5000 times, and find the percentage of repetitions in
which the calculated values of *F* exceeded the *Fs* obtained
from the original data. This is the *p*-value under the null hypothesis.

The above description applies easily to the case of a one-way Anova or a *t*
test, where it is obvious how permutations should be done. However things get messy when it
comes to a factorial Anova because it is not always obvious what should be permuted and
over what cells the permutation should take place. I will develop that idea in a
moment. But first I need to discuss the assumptions behind our tests.

If we have a true experiment, where observations (subjects) are
assigned to the conditions at random, then life is simple. We only
consider the case of the null hypothesis in forming our sampling
distribution of outcomes, and under the null hypothesis random
assignment will insure that the test is valid. However when we come to
observational studies things are a bit messier. In our example here, it
is quite clear that we cannot randomly assign subjects to the Small
versus Large sizes. In this case we must assume that the observations
are exchangeable, meaning that it is reasonable for a score to fall in
either group under the null. If the observations are independent, and
if errors come from the same distribution (i.e., they are identically
and independently distributed --iid) then observations are
exchangeable. You will notice that this is a much milder assumption
than we make with traditional parametric tests. (The data do **not**
have to be normal (and these data certainly are not) or have any other
specific distribution. But
notice that requiring the errors of observations to be iid brings in
the traditional assumption of homogeneity of variance.

There are a number of different approaches to analyzing factorial designs, but the first problem concerns the interaction. In the case of an interaction there is no exact test that we can use. (An exact test is one that has a probability of a Type I error of exactly .05, for example, under repeated sampling when the null is true.) But we can develop tests which are approximately exact, and that is very likely to be more than adequate for our needs except in extreme cases.

We need to run our test so that we test our interaction in such a way as to
control for main effects. One way to control for main effects is to
restrict how we do our permutations. For example, if we want to control
for a main effect of Size, we could restrict our permutations so as to
only permute among months within each Size group. Then any Size effects,
if present, will not influence the results of resampling. For example, we could
permute observations between Months for Small lizards and then do the
same thing for Large lizards, but we could not permute a score from one
lizard size into another. This would help to control for effects of
Months, but we still have potential effects of Size. If we further
restrict permutations to stay within each Size x Month combination, we
would obtain the same *F* for all allowable permutations, and that
wouldn't get us far. So we need to try something else.

First we need to read in the data. In *R* the commands would be

ants <- c(13, 242, 105, 182, 21, 7, 8, 59, 20, 24, 312, 68, 515, 488, 88, 460, 1223, 990, 18, 44, 21, 140, 40, 27) #These next two commands merely set up #the factors of size and months as 1s and 2s or 1, 2, 3, 4. size <- as.factor(rep(1:2, each = 3, times = 4)) months <- as.factor(rep(1:4, each = 6))

One way to permute our data would be to randomize them over all
cells in the experiment. For example, we could take our 24 values,
shake them up in a hat, and write them down in whatever order they came
out of the hat next to the columns that contain information on Size and Months.
We would calculate the *F* for our three effects, store those values away,
and then repeat this procedure another 4999 times. This would leave us with 5000
values of *F*_{SM}, *F*_{S}, and *F*_{M} that would reasonably occur under the null hypothesis.
We then compare our obtained *F* against that distribution and calculate
the percentage of replications under the null hypothesis when the resampled *Fs*
exceed the obtained *F.* We would do the same thing with the two main effects.
This is the procedure recommended by Manly (2007), and is perhaps the easiest to
carry out. The R code and the resulting probability values follow.

#################################################################### Standard Anova on these data mod1 <- lm(ants ~ size + months + size:months) ANOVA <- summary(aov(mod1)) cat( " The standard ANOVA for these data follows ","\n") Fsize <- ANOVA[[1]]$"F value"[1] # SavingFvalues for future use Fmonths <- ANOVA[[1]]$"F value"[2] Finteract <- ANOVA[[1]]$"F value"[3] print(ANOVA, "\n") cat( "\n") cat( "\n") print( "Resampling as in Manly with unrestricted sampling of observations. ") # Now start resampling nreps <- 5000 FS <- numeric(nreps) #Set up space to storeFvalues as calculated. FM <- numeric(nreps) FSM <- numeric(nreps) FS[1] <- Fsize # The firstFof our 5000 FM[1] <- Fmonths FSM[1] <- Finteract for (i in 2:nreps) { newants <- sample(ants, 24) mod2 <- lm(newants ~ size + months + size:months) b <- summary(aov(mod2)) FS[i] <- b[[1]]$"F value"[1] FM[i] <- b[[1]]$"F value"[2] FSM[i] <- b[[1]]$"F value"[3] } probS <- length(FS[FS >= Fsize + .Machine$double.eps ^0.5])/nreps probM <- length(FM[FM >= Fmonths+ .Machine$double.eps ^0.5])/nreps probSM <- length(FSM[FSM >= Finteract + .Machine$double.eps ^0.5])/nreps ### The addition of "+ .Machine$double.eps" is an aid against two numbers that differ only by ### floating point computer calculations at the extreme. cat(" The probability value for the interaction is ",probSM, "\n") cat(" The probability value for Size is ", probS, "\n") cat(" The probability value for Months is ", probM, "\n")Manly's Approach--unrestricted permutation of observations.The probability value for the interaction is 0.048 The probability value for Size is 0.0458 The probability value for Months is 2e-04

Edgington (2007) goes at the problem from a different direction. He maintains
that there is no exact test for interactions, but suggests that you can
get an "indication" of the presence of interactions by testing the
interaction in the same way that Manly did. But he only tests the
interaction that way. In other words Edgington permutes all
observations across all cells 5000 times, computes the *F* for
the interaction for each one, and then calculates the percentage of cases
where the *F* on the permuted data exceeded our obtained *F*.

But what about the main effects? Well, there are two ways to deal with them. If the interaction is significant, you probably don't care about the main effects. I have been a holdout arguing that it is OK to look at the main effects if you have a good reason, but I am coming to the view that you really seldom have a good reason to want to deal with the main effects in the face of an interaction. I am now joining the rest of the statistical community. Look at simple effects if you want to, but skip the main effects.

But if the interaction effect is not significant you would probably want to
go ahead and deal with main effects. One way to do this is the same way
that Manly does. But Edgington would argue that if you don't have an
interaction, your best model is
Y_{ijk} = μ + S_{i.} + M_{.j} + ε_{ijk},
which is an additive model--it does not have an interaction. (This is the approach
that SPSS calls Type II SS, although SPSS does not make it conditional on a
nonsignificant interaction.) Edgington reasons that if this is now your model
you don't have to adjust for an interaction. Therefore you can test Size by
shuffling the data for each month separately between size categories, and then test Months by
shuffling the data for each size among the month categories. In each of
these steps you form the distribution of the randomized *Fs* and compare
your obtained *Fs* against those distributions. The code for this, and
the resulting output, is shown below.

################################################################################ Now moving to Edgington # Edgington takes the interaction to be the same as Manly above. # To test Size and Month he uses restricted randomization and the main effect Fs. cat( "\n") cat( "\n") print( "Resampling as in Edgington with restricted sampling for main effects. ") cat("Edgington's test for interaction has probability = ", probSM, "\n which is the same as Manly's.") # I am now going to work with an additive model. # Main effect of Size FS <- numeric(nreps) FS[1] <- Fsize for (i in 2:nreps) { M1 <- ants[1:6] # Combining across size within month M2 <- ants[7:12] M3 <- ants[13:18] M4 <- ants[19:24] temp1 <- sample(M1,6, replace = FALSE) temp2 <- sample(M2,6, replace = FALSE) temp3 <- sample(M3,6, replace = FALSE) temp4 <- sample(M4,6, replace = FALSE) newants <- c(temp1, temp2,temp3, temp4) mod5 <- summary(aov(newants ~ size + months )) FS[i] <- mod5[[1]]$"F value"[1] } probS <- length(FS[FS >= Fsize + .Machine$double.eps ^0.5])/nreps cat("The probability for the effect of Size is ", probS,"\n") # Main effect of Months FM <- numeric(nreps) FM[1] <- Fmonths for (i in 2:nreps){ S1 <- ants[size == 1] S2 <- ants[size == 2] temp1 <- sample(S1, 12, replace = FALSE) temp2 <- sample(S2, 12, replace = FALSE) newants <- c(temp1,temp2) mod6<- summary(aov(newants ~ size + months)) FM[i] <- mod6[[1]]$"F value"[2] } probM <- length(FM[FM >= Fmonths + .Machine$double.eps ^0.5])/nreps cat("The probability for the effect of Months is ", probM, "\n") [1] "Resampling as in Edgington with restricted sampling for main effects. " Edgington's test for interaction has probability = 0.048 The probability for the effect of Size is 0.04 The probability for the effect of Months is 2e-04

Still and White (1981) follow Edgington in using restricted randomization
for main effects, but they control for main effects by using residuals. In this
case the residuals are what is left over after you remove row and column effects.
If you look at all of your data combined, there are potential main
effects of Size and Months included in them. But if we subtract out
those effects and use the residuals that result, we can test the
**interaction** without worrying about main effects. Because we have
removed any possible row and column effects, the residuals are
exchangeable under the null hypothesis. One simple way to do this is to
compute r_{ijk} = Y_{ijk} - Ybar_{i.} - Ybar_{.j} + Ybar_{..}.
In other words for each observation in Row_{1} and Column_{2},
we subtract out the Row_{1} mean and the Column_{2}
mean and add back in the grand mean. We do the analogous thing to
observations in the other cells. If you now ran a complete analysis of
variance on the residuals you would find that the sums of squares for
Size and Months were exactly 0, and the *F* for the interaction
would be a fair test of that interaction uncontaminated by main effects.

The procedure just described is the easiest way to describe what Still and White are doing in computing their interaction, but I can accomplish the same thing by fitting an additive model (a model with no interaction term) to the data and asking my program to calculate residuals. In other words, if you were using SPSS, for example, you would just tell it to run a two-way Anova but specify that the model does not contain the interaction term. (Just click on the "model" button.) Then ask it to save the residuals and use those residuals as the dependent variable in a two-way with the interaction included. You will get the same result either way you do it, but I think that the second is easier to code.

But we only do this for the interaction. The main effects will not be calculated on these residuals because they would come out to be zero. When it comes to the main effects, Still and White follow Edgington's lead and will obtain the same results (within sampling error). Thus we don't have to recalculate those results because we will just steal them from Edgington. The code and the output follow.

# Now moving to Still and White (1981) # They test the interaction using residuals of the form # Y[ij] - Ybar[i,] - Ybar[.j] + Ybar[..] # The easiest way to do this is to get the residuals from an additive model. mod2 <- lm(ants ~ months + size) res <- mod2$residuals # These are the residuals from the additive model # Now do the resampling to get the interaction F. SHint <- numeric(nreps) SHint[1] <- Finteract for (i in 2:nreps) { newants <- sample(res, 24, replace = FALSE) mod7 <- summary(aov(newants ~ size + months + size:months)) SHint[i] <- mod7[[1]]$"F value"[3] } probInt <- length(SHint[SHint >= Finteract + .Machine$double.eps ^0.5])/nreps cat( "\n") cat("\n") print("Resampling as in Still and White with unrestricted sampling. ") cat("The probability for the effect of Interaction is ", probInt, "\n") # Still and White have the same main effects as Edgington if there is no interaction. cat("The probability for the effect of Size is ", probS, "\n") cat("The probability for the effect of Months is ", probM, "\n") ########################################################################### "Resampling as in Still and White with restricted sampling for main effects " and randomization of residuals for the interaction." The probability for the effect of Interaction is 0.0512 # Still and White have the same main effects as Edgington The probability for the effect of Size is 0.04 The probability for the effect of Months is 2e-04

ter Braak has done a great deal of work with randomization
procedures and he advocates an approach similar to the Still and White
approach except that in calculating the interaction the residuals are taken
over the whole design rather
than just the additive model. This amounts to removing the cell mean
from each observation, to create a set of residuals, and then permute
those residuals across all cells of the design. This is very much like
the Still and White approach but they only subtracted row and column
means, not individual cell means. Again this amounts to fitting a
complete model (*including the interaction*) and computing the residuals.
The code and the results of this analysis follow.

# Ter Braak creates residuals from cell means and then permutes across # all cells # This can be accomplished by taking residuals from the full model mod9 <- lm(ants ~ months + size + months:size) res2 <- mod9$residuals TBint <- numeric(nreps) # We don't have to calculate the effects of Month and Size, because # these come from the calculations for Edgington. TBint[1] <- Finteract for (i in 2:nreps) { newants <- sample(res2, 24, replace = FALSE) mod10 <- summary(aov(newants ~ size + months + size:months)) TBint[i] <- mod10[[1]]$"F value"[3] } probInt <- length(TBint[TBint >= Finteract + .Machine$double.eps ^0.5])/nreps cat( "\n") cat( "\n") print( "Resampling as in ter Braak with unrestricted sampling of cell residuals. ") cat("The probability for the effect of Interaction is ", probInt, "\n") cat("The probability for the effect of Size is ", probS, "\n") cat("The probability for the effect of Months is ", probM, "\n") "Resampling as in ter Braak with unrestricted sampling of cell residuals. " The probability for the effect of Interaction is 0.0488 The probability for the effect of Size is 0.04 The probability for the effect of Months is 2e-04

The solutions given here are not the only possibilities, but they cover the important ones. The paper by Anderson (2001) is excellent and offers a number of recommendation based on power and error rates coming from on a related study by Anderson and ter Braak (unpublished). She basically suggests using a test such as Manly's for the interaction, where you use unrestricted permutation of raw observations over all cells. This is particularly helpful with small sample sizes. When it comes to testing main effects, Anderson suggests that permutation of residuals under the reduced model. This amounts to calculating residuals from an additive model and then randomizing them the same way that Edgington would randomize observations--i.e., restricting the reandomizations to individual levels of the other variable.

We can summarize the results of these different analyses by setting up a table showing the obtained *p*-values.
This table follows. You will notice two things. First, it does not make
a huge difference which approach you take, although there are
differences. Second, some of the probabilities are exactly equal because for
those terms the different approaches do the same thing. For a good
overview of this whole field, going beyond ANOVA into partial
regression, I recommend Anderson's paper.

Summary of results inp-values

Effect Parametric F Manly Edgington Still-White ter Braak Size .0506 .0485 .0400 .0400 .0480 Months .0000 .0000 .0000 .0000 .0000 SxM .0617 .0480 .0480 .0512 .0488

(This should be a separate page, but I haven't gotten around to that yet.) So far I have only spoken about standard factorial designs, often called crossed designs. In these cases every level of the row variable is paired with every level of the column variable. But we sometimes have what are called "nested designs," in which we don't have the same kind of pairing. We need a way to deal with those designs as well.

Suppose that we want to take 5 male therapists and 5 female therapists and have each of them provided a numerical rating of each of 10 clients. You don't have the same type of crossing of therapists with gender because Mary cannot serve as a male therapist and John can't serve as a female therapist. So we can't calculate an interaction of therapist x gender. Instead, we will have a term for Gender, a term for Therapist(Gender), read "therapist within gender," and a term for within cell error. The Therapist(Gender) term will actually be the sum of the simple effect of Therapist at each level of Gender.

The following data can be used to illustrate the
problem.

Therapist12345 Male 98681046577 79666116387 111386141113131011 1211161192312101911 1019145101114151111 Therapist678910 Female 8646765797 1078104710677 14111814132217161211 20161615181620221419 21191715221622221821

It is difficult to produce the standard analysis of
variance summary table for a nested design in *R*. (At least it is difficult for
me.) But you can get the appropriate result (sort of) if you use the command

a <- summary(aov(dv ~ gen + th/gen )).

The output is shown below

Df Sum Sq Mean Sq F value Pr(>F) gender 1 240.25 240.250 29.936 3.981e-07 *** therapist 8 1705.24 213.155 26.559 < 2.2e-16 *** Residuals 90 722.30 8.026 ---

BUT that is not quite what you want. Gender should not be tested against the residuals. You can compute your *F* for Gender by dividing by MS(Therapist(Gender)). You can get the *F* for Therapist(Gender) if you divide Therapist by Error: within (also shown as Residuals). The correct result is given
below.

Source df SS MS F Gender 1 240.25 240.25 1.12711 Error1 8 1705.24 213.155 Therapist(Gender) 8 1705.24 213.155 26.5595 Error2 90 722.3 8.02556

As you can see from the summary table, there is clearly no effect due to Gender, but there is a significant Therapist within Gender effect. This is not particularly exciting because all that it really says is that therapists are not all alike, which finding will not likely win us the Psychologist of the Year award.

It is relatively easy to create the necessary code for a permutation test on these data. What is important here is to recognize that the permissible randomizations depend on the effect being tested. For example, to test differences between therapists nested within genders [Therapist(Gender)] it does not make sense to redistribute scores from a male therapist with scores from a female therapist because therapists are nested within treatment. So what we will do is to restrict our randomization of observations to stay within the same category of Gender. But to test the effect of Gender, we will randomly allocate "whole" therapists. By that I mean that we will take the 10 scores for each therapist and move them as a whole from one Gender to another. To quote Anderson, with suitable changes in variable names,

"If the null hypothesis were true, the variability due to Therapists within Genders would be similar to the residual variation. That is, variability due to Therapist is exchangeable with residual variation under the null hypothesis. If SS_{total}and SS_{Gender}remain constant under permutation, then the shuffling strategy is 'mixing' variability only between SS_{T(G)}and SS_{residual}, which gives an exact test for differences among genders."

# Resampling Nested Designs dv <- c(9,8,6,8,10,4,6,5,7,7, 7,9,6,6,6,11,6,3,8,7, 11,13,8,6,14,11,13,13,10,11, 12,11,16,11,9,23,12,10,19,11, 10,19,14,5,10,11,14,15,11,11, 8,6,4,6,7,6,5,7,9,7, 10,7,8,10,4,7,10,6,7,7, 14,11,18,14,13,22,17,16,12,11, 20,16,16,15,18,16,20,22,14,19, 21,19,17,15,22,16,22,22,18,21) ther <- as.factor(rep(1:10, each = 10)) gender <- as.factor(rep(1:2, each = 50)) n <- 10 N <- length(dv) g <- nlevels(gender) tt <- nlevels(ther) # F values from Nested.R mod1 <- summary(aov(dv ~ gender*ther)) Fthwingen1 <- mod1[[1]]$"F value"[2] Fgen1 <- mod1[[1]]$"Mean Sq"[1]/mod1[[1]]$"Mean Sq"[2] Fthwingen <- numeric(nreps) Fgen <- numeric(nreps) # First we will compute the Gender effect # Randomly shuffle Therapists across Gender # Make a matrix containing therapists on columns dvmat <- matrix(dv, nrow = tt, byrow = TRUE) for (i in 1:nreps) { worms <- dvmat[sample(1:10,10),] #worms??? newdv <- as.vector(t(worms)) a <- summary(aov(newdv~gender*ther)) # This gives us the appropriate terms because it removes # the 1 df that would go to the interaction from th, giving ther(gender) # But if you enter the terms in the other order it won't work. MSgen <- a[[1]]$"Mean Sq"[1] MSth <- a[[1]]$"Mean Sq"[2] MSerrorwin <- a[[1]]$"Mean Sq"[3] Fgen[i] <- MSgen/MSth S1 <- dv[gender == 1] S2 <- dv[gender == 2] temp1 <- sample(S1, length(S1), replace = FALSE) temp2 <- sample(S2, length(S2), replace = FALSE) newdv2 <- c(temp1, temp2) mod6<- summary(aov(newdv2 ~ gender * ther)) Fthwingen[i] <- mod6[[1]]$"F value"[2] #Fgen[i] <- MSgen/ mod6[[1]]$"Mean Sq"[2] } probgen <- length(Fgen[Fgen >= Fgen1])/nreps probth <- length(Fthwingen[Fthwingen >= Fthwingen1])/nreps par(mfrow = c(2,2)) hist(Fthwingen, breaks = 50) hist(Fgen, breaks = 50) # This works, but the question is what is the denominator to test gender. # It should be MS(ther(gender)), but that is calculated with a different randomization. # I have used MSther(gender) from first analysis.

Edgington, E. S. and Onghena, P. (2007)

Manly, B. F. J. (2007)

ter Braak, C. J. F. (1992) Permutation versus bootstrap significance testss in multiple regression and ANOVA. In

dch: