Factorial Anova--Power and Unequal Sample Sizes

2/5/2002

Announcements

Hand back assignments

I probably won't get through all of this today. I am going back and talking about part of last Tuesday's class first.

I have decided that the following approach just doesn't work. The material is technically correct, but the forest gets lost for the trees. So I have prepared another page. This page short circuits much of the structural model stuff below and jumps right to power. I prefer the approach that I use below, because it ties power to the model, but I recognize that it is more difficult to follow. So try the new page first.

Structural Models

What is a model?

A model is our belief in the underlying structure of our data.

When we say that we expect that memory will be affected by differences in level of cognitive processing, and that it might be dependent on the age of the participant or on the interaction of processing level and age, then we are saying that those variables lie behind our data.

Our recent colloquium speakers have all spoken in terms of models, although in their case the models were either structural equation models or growth models. But the specific form of the model is not important here. In each case we are speaking about a structure that is assumed to lie behind the data.

An Anova we use models that include all of the independent variables and their interactions. (In regression, we will do it differently.)

What does a model look like?

For the one-way, our model was simply

Xij = µ + tj + eij

where tj = µj - µ

and eij = Xij - µj

For the two-way, we need to include two independent variables a and b, and their interaction ab.

Thus

Xijk = µ + ai + bj + abij + eijk

where

ai = µi. - µ  (This is called the treatment effect for level i of variable A, and simply measures how far its mean deviates from the grand mean.)

bj = µ.j - µ  (This is analogous to the treatment effect in the previous line.)

abij = µij - µi. - µ.j + µ  (This term refers to how much a cell mean deviates from what we would have expected if everything were determined by row and column means, and there were no interaction.)

 

Cell Means

(The following table does not give cell effects (interaction effects), but they could be calculated as abij = µij - µi. - µ.j + µ.) (We will see these calculated in the next example.)

 

For Eysenck's data on Cognitive processing and Age, we have:

  Count Rhyme Adj Image Intent Mean Effect (ai.)
Older
7.00
6.90
11.00
13.40
12.00
10.06
-1.55
Younger
6.50
7.60
14.80
17.60
19.30
13.16
1.55
Mean
6.75
7.25
12.90
15.50
15.65
11.61
Effect (b.j)
-4.86
-4.36
1.29
3.89
4.04

To estimate the interaction effect for the Older Count cell, we have

 abij = µij - µi. - µ.j + µ.

ab11 = 7.00 - (10.06) - (6.75) + 11.61 = 1.80

 

  We need these effects to calculate power.

Power

Calculation of Power for Spilich Data Given
Sample Means as Parameters

This material has been cut and pasted from the stuff that I didn't cover on last Tuesday's class. I have tried not to make this material too redundant, but I wanted to talk both about power for the Spilich data and power for a different data set.

The following table gives the cell means, the row and column means, and the treatment effects. (The treatment effects in each cell are in parentheses.)

It is important that students understand this, because they are almost certain to have to carry out power calculations before they finish their degree. (At least I hope they do.) I may very well put something like this on an exam.

 

Nonsmoker

Delayed
Smoker

Active
Smoker

Mean

Effect

Pattern Rec

9.40
(1.948)

9.60
(-0.563)

9.93
(-1.385)

9.64

-8.615

Recall

28.87
(-7.718)

39.93
(0.637)

47.53
(7.081)

38.78

20.518

Driving 9.93
(5.770)
6.80
(-0.074)
2.33
(-5.696)
6.36 -11.904
Mean

16.07

18.78

19.93

18.26

 
Effect

-2.193

0.519

1.674    

The summary table for this factorial design follows. We will want to have it available later, so I stuck it in here.

Note how the "effects" are calculated. I have done that immediately below. They are the deviations or row or column means from the grand mean, and then, for the interaction, deviations of cell means from the grand mean with row and column means removed.

Students need to understand how to calculate them. (There is an error in the first cell effect, because 9.94 should read 9.40. The answer is correct. It would take me too long to correct this, because it is one large graphic.)

effects.gif (5892 bytes)

MSerror is found in the summary table to be 107.835.

Power in a factorial is a direct extension of the way we calculated power with a one-way. There we calculated

factor1.gif (616 bytes)

Here we will simply extend that to rows, columns, and interactions.

In what follows I have replaced terms like factor2.gif (191 bytes) with Saj2

I use the symbols a and b to refer to the number of rows and columns. 

Cohen uses the letter f where I use F, but that is just a difference in terminology. I have history on my side, whereas Cohen has Cohen's prestige his side.

We can now incorporate the sample sizes and then look that up in tables of the non-central F distrtibution. I cover this in the book. 

phi.gif (1080 bytes)

We would go to the tables of the non-central f distribution with these values of F.

But today I am doing things differently, using Cohen's tables, and I don't need these statistics for that. I will work directly with the F' values.

There is a problem when we come to specifying the sample size for Cohen's tables. We are going to have to enter the table with the relevant F' and the sample sizes. We have already calculated F'. For reasons I won't go into, Cohen defines the adjusted sample size as 

n' = dferror/(dfeffect +1) + 1

For our main effects this becomes 126/3 + 1 = 43

and for the interaction this is 126/5 + 1 = 26.2

Using the tables From Cohen , with dfe = 30 and phi-prime rounded, or a program called G*Power, I calculate power as

Effect n' Phi-prime

f

Phi Power
Task 43 1.42 9.55 .99
SmkGrp 43 0.16 1.07 .32
Interaction 26 0.43 1.68 .98

Notice:  I have shown both phi-prime and f. They are used interchangeably, depending on what book you are reading at the time. Cohen was the one who popularized f. (My table also includes phi, but we won't need that here.)

Cohen's Tables

Cohen (1988) Statistical Power Analysis for the Behavioral Sciences (2nd ed.) calculates a statistic called f. This is equivalent to f’, and not to f.

Cohen's tables are very extensive, and I can not present all of them here. As an example, I have scanned in the one for a test at the .05 level with 2 df for treatments. That appears below. (In this table n is the sample size, or, in the case of factorials, it is the n' that we calculated above. ) These tables have been scanned directly from Cohen's book.

Table 8.3.13

Power of F test at a = .05, u = 2

f

n F c .05 .10 .15 .20 .25 .30 .35 .40 .50 .60 .70 .80
2 9.552 05 05 06 06 07 07 08 08 10 12 15 18
3 5.143 05 05 06 07 08 09 10 12 17 22 29 37
4 4.256 05 06 06 08 09 11 14 17 24 33 44 54
5 3.885 05 06 07 09 11 14 17 22 32 44 56 69
6 3.682 05 06 07 10 13 16 21 26 39 53 67 79
7 3.555 05 06 08 11 14 19 25 31 46 62 76 87
8 3.467 05 06 08 12 16 22 28 36 53 69 83 92
9 3.403 05 07 09 13 18 24 32 40 59 75 88 95
10 3.354 05 07 10 14 20 27 35 45 64 81 91 97
11 3.316 05 07 10 15 21 30 39 49 69 85 94 98
12 3.285 06 07 11 16 23 32 42 53 74 88 96 99
13 3.260 06 08 11 17 25 35 46 57 77 91 97 99
14 3.238 06 08 12 18 27 38 49 61 81 93 98  
15 3.220 06 08 13 20 29 40 52 64 84 95 99  
16 3.205 06 08 13 21 31 43 55 67 86 96 99  
17 3.191 06 09 14 22 33 45 58 70 89 97 99  
18 3.179 06 09 14 23 34 48 61 73 90 98    
19 3.168 06 09 15 24 36 50 64 76 92 99    
20 3.159 06 09 16 26 38 52 66 78 93 99    
21 3.150 06 09 16 27 40 54 69 80 95 99    
22 3.143 06 10 17 28 42 57 71 82 96 99    
23 3.136 06 10 18 29 43 59 73 84 96      
24 3.130 06 10 18 30 45 61 75 86 97      
25 3.124 06 10 19 32 47 63 77 87 98      
26 3.119 06 11 20 33 48 65 79 89 98      
27 3.114 06 11 20 34 50 66 Bo go 98      
28 3.110 06 11 21 35 52 68 82 91 99      
29 3.105 06 12 22 36 53 70 83 92 99      
30 3.102 06 12 22 37 55 71 85 93 99      
31 3.098 07 12 23 39 56 73 86 94 99      
32 3.095 07 12 24 40 58 75 87 94 99      
33 3.091 07 13 24 41 59 76 88 95        
34 3.088 07 13 25 42 61 77 89 96        
35 3.086 07 13 26 43 62 79 go 96        
36 3.083 07 '13 26 44 63 so 91 97        
37 3.081 07 14 27 45 65 81 92 97        
38 3.078 07 14 28 46 66 82 92 97        
39 3.076 07 14 28 47 67 83 93 98        

 

Table 8.3.13 (continued)

f

n FC .05 .10 .15 .20 .25 .30 .35 .40 .50 .60 .70
40 3.074 07 15 29 48 68 84 94 98      
42 3.070 07 15 30 51 71 86 95 98      
44 3.066 07 16 32 53 73 88 96 99      
46 3.063 07 16 33 55 75 89 96 99      
48 3.060 08 17 34 57 77 90 97 99      
50 3.058 08 18 36 58 79 92 98 99      
52 3.055 08 18 37 60 80 93 98        
54 3.053 08 19 38 62 82 94 98        
56 3.051 08 19 40 64 83 94 99        
58 3.049 08 20 41 65 85 95 99        
60 3.047 08 21 42 67 86 96 99        
64 3.044 08 22 45 70 88 97 99        
68 3.041 09 23 47 73 90 98          
72 3.039 09 24 49 75 92 98          
76 3.036 09 25 52 78 93 99          
80 3.034 09 27 54 80 94 99          
84 3.032 10 28 56 82 95 99          
88 3.031 10 29 58 84 96 99          
92 3.029 10 30 60 85 97            
96 3.028 10 31 62 87 97            
100 3.026 11 32 64 88 98            
120 3.021 12 38 73 94              
140 3.018 14 44 79 97              
160 3.015 15 49 85 98              
180 3.013 16 54 89 99              
200 3.011 18 59 92                
250 3.008 22 69 97                
300 3.006 25 78 99                
350 3.004 29 84                  
400 3.003 33 89                  
450 3.002 36 92                  
500 3.002 40 95                  
600 3.001 47 98                  
700 3.000 53 99                  
800 3.000 59                    
900 2.999 65                    
1000 2.999 70                    

 

For our example we would look up the 2 main effects with F' = 1.42 and 0.16, respectively, and sample sizes of 45. For the interaction we would enter with F' = .43, but with the sample size only equal to 26.2 (see calculations above). We can't use this specific table for power of the interaction, because that is on 4 df, not 2 df. I will give that answer in a minute.

The yellow sections in the table show the cells that are involved in the interpolation of power for SmokeGrp. For the interaction, Cohen's table would have us interpolate between .96 (for f = .40) and 1.00 for f = .50). The resulting answer is .98 (approx).

Cohen has argued that if you really don’t have any idea what values of µ to expect, or cannot calculate f‘ for some other reason, then you can define

small effect f= .10

medium effect f’ = .25

large effect f’ = .50

These correspond to h 2 = .01, .06. And .14, respectively.

G*Power

A great source for power is a program called G*Power. It is available at

http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/how_to_use_gpower.html

Even if you don't want the program itself, they have an excellent manual that covers lots of stuff about power.

Mark Gorman gave me another link that may be of interest. It is http://www.math.yorku.ca/SCS/Demos/power/ The only problem is that it is not super user-friendly.

I use G*Power below to illustrate how sample size affects power.

        Power as a function of sample size

Using G*Power

The G*Power manual is fairly straightforward if you think a bit. (There is one place where I am not 100% sure that I am exactly right, but I am close enough.

I used G*Power to calculate the values of f, and they are reasonably close to the ones above.

First we have the smokegrp main effect.

Notice that power = .36, which is pretty close to my calculation of .34.

Then I got the interaction. Notice that I tell it that there are 9 groups, because I am really telling it how many cells. I also give it the df for the interaction, which helps it sort out the cells. The effect size can either be calculated as I did in class, or it can be calculated using the Calc Effectsize button.

I used the effect-size button, which gave me a slightly larger value of f. Notice power is high. (If I put in f = 0.43, I get power = .9864.

Notice also that for these two analyses I selected Post Hoc and Special.

Then I plotted power for the interaction as a function of total sample size.

 

Unequal Sample Sizes

This is a BIG problem, and one that statisticians have grappled with for years. One well known statistician (I think it was Anderson), suggested that the only successful way to deal with unequal sample sizes is not to have any.

I’m going to start with an extreme example to make a point. Then I’ll move to a less extreme example, and you’ll see that the point is still there.

Suppose that we had the following data from a study on Intrusions and Avoidance. (The dependent variable is stress, and subjects are classified as being high or low in intrusive thoughts and in avoidance.

 

High Intrusions

Low Intrusions

Row Mean

High Avoid

56.96 (n = 28)

.

56.96

Low Avoid

.

45.08 (n = 28)

45.08

Column Mean

56.96

45.08

51.02

What do we have here? Is it an effect due to Intrusions, and effect due to Avoidance, or some combination. (It is obviously impossible to tell.)

Now we’ll take another extreme example, but not quite as extreme.

 

High Intrusions

Low Intrusions

Row Mean

High Avoid

56.96 (n = 28)

56.96 (n = 2)

56.96

Low Avoid

50.73 (n = 2)

50.73 (n = 28)

50.73

Column Mean

56.54

51.14

 

We don't really have an intrusion effect as far as most people would be concerned, because for each level of avoidance the high and low intrusion means are exactly equal. but here the column means make it look as if we have a difference due to Intrusions, but we really have a difference due to Avoidance, which gets converted to an Intrusion effect because of the unequal sample sizes. 

Now we’ll take another extreme example, but even less extreme. 

 

High Intrusions

Low Intrusions

Row Mean

High Avoid

56.96 (n = 28)

51.69 (n = 2)

56.61

Low Avoid

50.73 (n = 2)

45.08 (n = 28)

45.46

Column Mean

56.54

45.52

 

 

What kind of an effect is this? Whereas the first two had some obvious problems that no one could miss, these problems are a bit subtler. Whereas in Row 1 the High Intrusion group is about 5 points above the Low Intrusion group, and the same in Row 2, the column means are about 11 points apart. Do we have a 5 point difference, or an 11 point difference? The same holds for Avoidance differences. (Notice that I didn't answer that question, because the answer is not clear. I guess I would side with the 5 point difference.)

Now let’s go to a still less extreme example, but one which is perfectly reasonable.

 

High Intrusions

Low Intrusions

Row Mean

High Avoid

56.96 (n = 28)

51.69 (n = 16)

55.04

Low Avoid

50.73 (n = 14)

45.08 (n = 28)

47.78

Column Mean

54.88

47.48

51.10

What do we make of these data? It looks as if we have a difference between high and low avoiders, a difference between high and low intrusions, and not much of an interaction. And that may be what we have, BUT

The High Intrusion mean is somewhat inflated by the large number of high/high subjects, and the Low Intrusion mean is somewhat deflated by the large number of Low/Low subjects. In other words, the sample sizes are pulling these means in their own direction. What was happening was very clear in the first two examples, and sort of clear in the third example, but I would hate to try to explain in a paper exactly what is happening here.

An alternative table would weight each of these cell means equally. Thus, the mean for the High Avoidance row would be (56.96 + 51.69)/2 = 54.32

 

High Intrusions

Low Intrusions

Row Mean

High Avoid

56.96 (n = 28)

51.69 (n = 16)

54.32

Low Avoid

50.73 (n = 14)

45.08 (n = 28)

47.90

Column Mean

53.84

48.38

51.10

Notice that looking at the data this way has pulled the means somewhat closer together.

The table that I just presented contains what is known as "unweighted means" or "equally weighted means." I have given each cell mean the same weight in determining the row and column means. Another way of saying this is to say that I have "controlled" for sample sizes. An even better way of saying it is to say that in looking at the Avoidance means I have controlled for any differences due to Intrusions, and in looking at Intrusion mean differences I have controlled for any differences due to Avoidance.

Last year a student raised the question of which means do I really want to look at. Do I want to let sample size influence my results? If not, why not. If so, why and when? As far as I am concerned we rarely want the sample size to influence our conclusions, especially when sample size represents some sort of random variability around an equal sample size design. If the sample sizes actually reflect incidence of something in the population, perhaps I would agree to look at things differently.

But, how do I carry out an analysis on unweighted means?

In the text I talk about the "unweighted means" solution as you would carry it out with a calculator. That’s a good way to see what is actually going on, but no one would really do that anymore.

SPSS uses what is called "unique sums of squares" to accomplish this. "Unique" is the default, and there is probably never going to be a reason why you would change it from that.

 

If you used one of the other methods for running an Anova with unequal n’s, you would come up with different answers. Some of these answers would depend upon whether you clicked on Intrusions before or after Avoidance.

 

What have we learned about factorial designs?

Factorial designs involve 2 or more independent variables, where each level of each variable is crossed (paired) with each level of the other variable(s).

We like to see equal numbers of subjects in the cells.

Interaction effects are important, and, in fact, may often be more important than main effects. (We saw a good example of this in last Tuesday's class, where the effect of smoking depended entirely on the task you were looking at.)

In the presence of interaction effects, it is necessary to look at simple effects.

Each effect in a factorial Anova has its own probability of a Type I error, and its own level of power. Often the power of the interaction effect is much lower than the power for main effects.

If you want to study independent variables that could be spread out over several levels, you are usually smarter, from the point of view of power, to chose only a few levels. (I don't defend this statement in the text, but McClelland and Judd have shown quite persuasively that it is probably the logical thing to do.)

The treatment of unequal sample sizes hinges on the comparison of unweighted means, and the default in SPSS will make this comparison. (In SAS, use Type III sums of squares.)

With unequal n’s, it is impossible to truly get away from the contaminating effects of one independent variable on another. The two effects are correlated, and thus confounded. We’ll see an example of this correlation later.

The power of a factorial is not all that hard to calculate. You really just need to make some guesses as to population means, and that is a lot easier than you might suppose. The power of most of the experiments that we run is really much lower than we’d like to believe.

I have not talked about multiple comparisons, but the solution is pretty much the same as it was in a one-way. You can run any of the procedures on the row or column means. If you want to break up the interaction, I would suggest running multiple comparisons on simple effects, rather than trying to compare Cell 12 with Cell 31, for example. That way leads to confusion and confounding.

Similarly, I skipped the magnitude of effect measures, but they are covered in the text. They are really just extensions of the solutions for one-way designs.

Higher-order designs are really just an extension of what you know about the two-way design. But there are "simple interaction effects," which are just the interaction of two variables at one level of a third variable.

Last revised: 02/06/02