
I am going to continue to use the example that I used before Thanksgiving, based on a study by Foa, Rothbaum, Riggs, and Murdock (1991) in the Journal of Counseling and Clinical Psychology. The subjects were 45 rape victims who were randomly assigned to one of four groups. The four groups were 1) Stress Inoculation Therapy (SIT), in which subjects were taught a variety of coping skills; 2) Prolonged Exposure (PE), in which subjects went over the rape in their mind repeatedly for seven sessions; 3) Supportive Counseling (SC), which was a standard therapy control group; and 4) a Waiting List (WL) control.
In the actually study pre- and post-treatment measures were taken on a number of variables. For our purposes we will only look at post-treatment data on PTSD Severity, which was the total number of symptoms endorsed by the subject.
The descriptive statistics and the summary table for the analysis of variance follows.



This is what we saw last time. Obviously there are significant differences, but we don't know where they lie. My personal guess would be that the two control groups are different from the experimental groups, but I don't know whether the latter differ from each other or not.
Error rates
There are two kinds of error rates that we care about:
Error rate per comparison
This is the probability that any particular comparison will yield a Type I error. We don’t care about any other comparisons when we are talking about this, but only about the comparison in question.
If we ran a bunch of t tests at a = .05, then the per comparison error rate would be .05.
Error rate familywise
This is the probability that a particular set of comparisons will contain at least one Type I error. (It could contain 8 Type I errors for all we care, just so long as it contained at least 1.)
It should be apparent that the more tests we run, the more opportunity we will have to make an error, unless we somehow adjust our test to prevent this from happening.
If we ran several tests, each at a , the probability of at least one error is no greater than ca , where c is the number of comparisons, or tests.
In general, multiple comparison procedures are established to control the familywise error rate in some way. Different procedures do this in different ways.
When I looked at what I had planned to say in this class, I realized that I had left out the forest for the trees. The overall point is that all of the procedures that I will talk about are based on very similar statistics. The way they differ is in the way that they interpret those statistics.
To be more precise, all of these tests could calculate the same t value, though they often go about their work in what looks like a different way. The important difference between the tests is in how they evaluate the significance of that t. Basically, they look in different tables, which have been adjusted to keep the error rate within certain bounds.
The tests differ on the bounds within which they keep that error rate. Thus they really differ in terms of the tables they use (or, more precisely, the formulae they use to calculate the probability of t).
Most of the tests that we will use depend in some way on the basic t test that we discussed last semester. Just for review:
We are going to use this equation, or a direct variation on it, for many of out statistical tests. Notice that this is the test with pooled error variances, though MSerror is the error over any number of groups. (We will worry about the case where we shouldn't have been pooling error terms later.)
Instead of solving for t, many procedures solve for the critical difference between the two means. Taking the last t equation, just because it is convenient, we can calculate:
Then, if any difference between the means is greater than the critical difference, we can declare that to be significant.
In SPSS this step looks like:
Multiple Range Tests: Tukey-HSD test with significance level .05
The difference between two means is significant if
MEAN(J)-MEAN(I) >= 6.6862 * RANGE * SQRT(1/N(I) + 1/N(J))
with the following value(s) for RANGE: 3.71
Students might reasonably think that 6.6862 was the square root of MSerror, but that it isn't. Don't worry about the difference right now, just keep the general idea in mind.
One word of caution: Soon we will see a formula for q instead of t. They are just linear functions of each other, and everything I have said so far about t will also be true about q.
I’m jumping the gun a bit here, but I want to give a better sense of what kind of problem we are up against.. This is actually an exercise that we will do on Thursday, but I want to give you a peak at the issues. I took the data from a study by Laura Solomon and others, set N at 80, made the null false so that the populations had the means that Solomon found, and replicate the experiment 10 times using Tukey’s test. I won’t explain yet just what Tukey’s test is, other than to say that it compares every mean against every other mean, keeping the familywise error rate at a maximum of .05.
I ran these 10 trials and obtained the following results.
First, only 8 of the 10 replications found a significant overall F. SPSS could still go ahead and calculate the multiple comparisons for the null cases, but in doing so, nothing was significant. That left 8 experiments that found something.
The following table shows how often specified pairs of means were different:
IV
II
III
I
IV
II
2
III
4
I
8
1
2
You can see that one comparison (I vs IV) was always significant when the overall F was, whereas one of them (I vs II) was only significant 1 time. Obviously we have a lot more power for some contrasts than for others.
I’ll talk first about the a priori procedures, which are procedures which are planned out before the data are collected, and almost never involve very many of the possible comparisons.
1. Pairwise t tests among all pairs of means
This is a bad bad idea in almost all cases.
I have already discussed this in passing..
The error rate is almost always controlled per comparison, and the familywise rate just sort of floats.
Make some predictions from what you know about the Foa et al. study in terms of which groups will be different from which other groups.
I would predict that the SIT and PE groups would differ from the WL group, but not from each other. (Notice that this prediction ignores the SC group.) Suppose that we want to test the SIT versus PE difference.
One way to compare groups following an analysis of variance is to run a simple t test between means, as I said before. We can compare those groups using
Running this test to compare the SIT and PE groups, and doing the calculations by hand, we get.

Alternatively, I can calculate the t test for the difference between those two means (Groups 1 and 2) using the independent t test in SPSS. The t that you get will differ from the one above.

The difference is caused by the fact that this t test only looked at the data from these two groups, and MSerror is really just the weighted average of s12 and s22.
Now we will redo that analysis using Statistics/Compare Means/One-way Anova. This gets a bit tricky. Set the Anova up to use all 4 groups, and then click on the contrast button. The groups that you want to compare are groups 1 and 2, and the groups that you want to ignore are groups 3 and 4. The coefficients for groups 3 and 4 will be 0, to get rid of them. The coefficients for groups 1 and 2 will be 1 and -1, respectively. So, enter 1 in the coefficients box, and click on Add. Then put -1 in the coefficients box and click on Add. Do the same for the other 2 coefficients, and then click on Continue and then OK. Note that you will get two different values of t. What is the difference between these values? (show this using SPSS in class.)

Finally, go back to the last step, turn off any contrasts that are still on, and select post hoc tests, and click on the boxes for LSD and Bonferroni. Explain the printout.

In earlier versions of SPSS you could not easily run the multiple comparison analyses, such as LSD or Bonferroni using the GLM procedures. You now can. Rerun the last two analyses using GLM/Factorial. You should get the same answers, though you will have to do a tiny bit of extra work to figure out how to set up the analyses.

the only trick here is that you have to specify the independent variable on which you are doing the multiple comparisons. Click on Group and then you can select LSD, etc.
Last revised: 11/26/01