
Comment on exam
I plan to finish up my discussion of repeated measures designs today.
First, I want to say something more about a point that I made last Tuesday when I said that repeated measures designs can have problems with power on testing the between subjects components, though the design is more powerful for within-subjects components.
We typically think of a repeated measures design as more powerful than a comparable between-subjects design. This is not actually true, and it can create problems for those who run multiple trials in hopes of cashing in on the supposed increased power of repeated measures. I commented on Drake Bradley's work last week. I'll just put it here again, because I want to explain why things come out the way they do.
What follows is largely based on the work of Drake Bradley, at Bates. [Bradley, D. R., & Russell, R. L. (1998) Some cautions regarding statistical power in split-plot designs. Behavior Research Methods, Instruments, and Computers, and Bradley, D. R., and Orfaly, R. A. (1999) Statistical power of completely randomized and split-plot factorial designs. Behavioral Neuroscience.]
Suppose that we compare two treatments across 5 levels of some other between-subject variable. This is a standard 2 X 5 Anova, with both variables representing between-subject differences, and the expected mean squares for the error term is given below.
Now suppose that we compare those same two treatments across 5 trials, where trials is a within-subject variable. Then there are two error terms, the first testing between-subjects effects and the second testing within subject effects. The expected mean squares for error are now those shown below.
Note that the error term for the within-subject effect will shrink as r increases (assuming that it is positive, which should be a good bet).
Note, however, that the error term for the between effects will be larger (for larger values of r ) and will increase as t increases, where t is the number of levels of the repeated measure (usually trials).
These problems hold regardless of the assumption of sphericity. But the problem with sphericity, as we will emphasize in a minute, is that violations of sphericity makes it necessary for you to apply corrections and alternative procedures that reduce the degrees of freedom for error, thus reducing power even more.
Does this mean that you should not run repeated-measures designs? No! But it means that you ought to think about what you are doing. You shouldn't just throw in a bunch of extra trials just for the hell of it. Nor should you collect post-test and follow-up measures every week, unless there is a very good reason to do so. Doing so just increases t.
Overall (1996) [Overall, J.E. (1996) How many repeated measurements are useful? Journal of Clinical Psychology, 52, 243-252.] found a similar pattern, with power decreasing, or at least not increasing, as he increased the number of trials.
This is an interesting topic because there is so little said about it in any textbook, and there is almost less in the journals.
The first way to look at this topic is through standard trend analysis.
I'll start with the Evans et al. study of airport noise, because we know about that study, and because it is reasonably typical of the kinds of studies we do.
The overall analysis of variance is given below, along with the plots.
We can see a clear effects for Time , Location, and Time X Location.
This is clear in the plot.
It looks to me as if there is no particular trend for the control group, and a linear and quadratic trend for the Airport group.
Although the linear and quadratic parts of the interaction suggest that there is something interesting going on, I don't think that this analysis is particularly helpful because it collapses across the two groups, which are actually doing different things. I think a better way to see this is to run separate analyses at each location. We don't really care how the locations average out, we want to know if kids get worse near the airport, and what happens to kids in the control group.
Emphasize the point that the two main effects here are of no interest.
Near Airport Condition
Control Condition
The Near Airport condition is pretty clear, and it answers the question we wanted to ask.
The Control condition is much less clear. There is no linear trend, but there is a quadratic and cubic trend. I think that these are just noise in the system, associated with the fact that we have more power than we need. I don't have anything intelligent to say about them.
The point of this discussion was to suggest that when you have a repeated measure, especially when it is something like time or trials, you are probably better off looking at trend than you are looking at standard multiple comparison procedures such as Tukey. That is because you expect certain kinds of trends, and would be disappointed if you didn't see them..
Non-trend approaches
There has been a recent tendency to move away from omnibus tests, where we focus on all the levels of a specific variable, to the idea that we can gain a great deal by making very specific, usually pairwise, comparisons. When I write the 6th edition of this book I will move more in that direction, particularly for between group comparisons. This doesn't mean that trend analyses aren't important, because they are. When you do a trend analysis, the whole purpose is to focus on the pattern over multiple trials, and that is exactly what trend analysis does. But when you have different groups, or when you have something like Pre, Post, Follow-up, your interest may lay more is specific pairwise comparisons--and I don't mean in all pairwise comparisons, as you would get from Tukey.
I don't have a great example of a variable which is repeated, but is not trials. (I really don't want to fall back on Foa again, much as I like that study.) So I will fall back on a "trials" design, but analyze it differently. In this case, I think that this is actually a good example of a case where a few pairwise comparisons are of interest.
Habituation in Mice
The data are taken from Psych 110 in 1999. This was a doubly-within subject study, with no between subject variable. Mice were placed in a chamber and presented with a startle tone for 50 trials, and startle response was recorded. Then they were given a 30 minute break and run again. The trials data were broken down into 5 blocks of 10 trials each for each session, and it is the blocks that we will analyze.
The data are available in habituation.sav.
From this graph it looks as if there is a linear trend in both sessions, and perhaps a quadratic in the first session. but I am not going to focus on that.
Instead, I want to know if there was a significant drop from block 1 to block 5 in session 1, which would be habituation. Then I want to look at recovery from block 5 session 1 to block 1 session 2. Then is there habituation in session 2. Think of this as really being 10 blocks of 10 trials each. In other words, stretch it out in a line.
Since these are all repeated, the test is simple.
t tests
Interpret these results.
The previous analysis used individual t tests (with their own special error terms). An alternative approach would be to use some sort of contrasts, though I don't think that's a great idea in this example. The only reason that I am discussing contrasts here is to make students aware of them. When statisticians speak about a "contrast," all they are usually talking about is comparing one group (or set of groups) against another group (or set of groups). We saw this several weeks ago when we looked at contrast coefficients. But you have a contrast whether or not you specifically use coefficients--just so long as you are comparing one thing against another.
There are many different contrasts that SPSS allows me to specify. These aren't something that students should memorize; they are things to be looked up when you need them.
- Deviation
- Each level of the factor except one is compared to the grand mean.
- Polynomial
- The first degree of freedom contains the linear effect across the levels of the factor, the second contains the quadratic effect, and so on.
- You can specify spacing if you don't have equally spaced intervals.
- Difference
- Each level except the first is compared to the mean of the preceding levels.
- Helmert
- Each level except the last is compared to the mean of succeeding levels.
- Simple
- Each level except the last is compared to the last (or to the first)
- With syntax you can specify a different reference level.
- Repeated
- Each level except the first is compared to the previous level.
- Special
- Uses user-supplied contrasts
For this example, it is not clear exactly what I want. For an example I will run things separately for the first session and make simple contrasts against the first block. But I want students to think about whether this is really asking a question to which we would like the answer. The important thing about any contrasts is to ask a few very focused questions, and not to get carried away with testing everything against everything else.
I'm not sure that this has really told me anything that I'm very excited about. I think that the three t tests that I ran earlier were far more meaningful.
Even in Version 10.0 it doesn't allow post-hoc analyses on within-subject factors.
But, do we really want them?
I think that our complaints that it won't give us these post-hoc analyses are most often just a complaint that we can't have everything, more than a complaint that we can't have something that we need.
Ask students for an example where we would find that useful.
On Thursday I will spend the time reviewing for the exam the following Thursday. Be prepared with questions that get at what you don't understand. Before then go back over the whole semester and see what you know and what you don't know.
Last revised: 2/21/02