This example is based on Klesges, R. C., et al. (1998) The prospective relationship between smoking and weight in a young, biracial cohort: The coronary artery risk development in young adults study. Journal of Counseling and Clinical Psychology, 66, 987-993.
The study looked at weight changes over a seven year period in subjects who did, and did not, stop smoking. The authors broke down subjects by smoking condition, race, and sex, but the way they presented their data I was not able to include sex as a variable in my example.
One reason for choosing this example is that it involved very large samples. We usually use small samples for examples, and I thought it would be useful to look at a case with thousands of subjects.
At baseline, data were collected on weight, smoking behavior (Never, Former, and Smoker), and other variables for over 5000 subjects. Seven years later data were obtained from 3868 subjects on their smoking status (Never, Former, Quitter, Intermittent, Initiator, and Continuous), their weight, and their weight gain. Data were also collected on alcohol use and caloric intake..
For this example I am going to run one-way analyses for smoking behavior on the pretest and the posttest data separately. I will ignore race and sex. I'll may come back to those variables later.
The data are found in Klesges.sav for 3868 subjects. The variables are Race, Basesmoke, Endsmoke, Alcohol, Basewt, Endweight, WtChange, and Fatpercn, in a different order. I generated these myself based on their data. Weight is given in kilograms.
First we'll look at differences in weight of the three smoking conditions at the beginning of the study. What would students predict?
Notice that the overall Anova is significant, but when we run the multiple comparisons the only significant difference is the 2.58 kilogram difference between Nonsmokers and smokers. Interestingly, the ex-smokers fall in the middle and don't differ from either group. So it is not true that quitting smoking led to weight gain--at least over the long term.
Some basic formulae:
Multiple Comparison Procedures
There are a whole set of procedures for making comparisons between groups subsequent to the overall analysis of variance. I cover many more in the text that I can cover here, but the basic ideas are very simple.
In the text I distinguish between two kinds of error rates. One is the probability that any particular comparison will be a Type I error, and the other is the probability that any set of comparisons will contain at least one Type I error.
Error rate per comparison (PC)
This is the probability that any given test will be significant if the null hypothesis is true. If we just ran a simple t test between two means at alpha (a) = .05, then the probability that a Type I error would occur is .05.
Familywise error rate (FW)
This is the probability that a whole set of comparisons will contain at least one Type I error. It should be apparent that the more tests you run, the greater the likelihood that you will make a Type I error someplace. [The more you have sex, the more likely you are to get pregnant.]
Suppose that we have a situation where we somehow know that m1 = m2, m3 = m4, and m5 = m6. Suppose that we ran 3 independent t tests, each at a = .05. Then the probability that the first comparison will be significant is .05, the probability that the second will be significant at .05, and the probability that the third will be significant is .05. The probability that at least one of these will be significant is approximately 3(.05) = .15. [In fact, for independent tests it is really 1 - (1 - a)c, where c = number of comparisons. This is 1 - .953 = .1426]
If you have sex once/night for a week, the question is the likelihood that you will be pregnant at the end of the week, and we aren't concerned about which night.
The major point behind almost all multiple comparison procedures is to reduce FW to something reasonable, such as .05, and we do that by reducing the significance level for any particular comparison to a small value. [In this case, if we ran each test at a = .01667, the familywise error rate would come out to about 1 - (1 - .016667)3 = .05(approx).]
A Priori and Post-hoc Procedures
I hate to discuss this issue because it seems to get people all twisted around. The basic idea is rather simple. If you plan out your comparisons before you run your experiment, you get to use more liberal procedures. If you plan your comparisons after you have looked at the data, even if you run just a few tests, it is as if you were running all possible comparisons among the means. This will require much more conservative procedures.
In actual practice, the vast majority of the situations that I see involves post-hoc tests. People don't really plan out everything ahead of time. They wait until they get to their data, and then they decide what they want to test.
I'm not sure that this distinction, while completely defensible on theoretical grounds, is the best one to make here. I prefer to think of it a bit differently.
If you want to make just a few comparisons that were decided on before you looked at the data (or that at least flow logically from the theory), then you probably want what I call a priori procedures.
If you want to make lots of pairwise comparisons, regardless of when you thought of them, then you probably want post hoc procedures.
A Priori procedures
In the case of truly a priori tests, I recommend that you just run simple t tests between the means that you want to compare. The one difference is that I would use MSerror from the overall Anova in place of the individual group variances, unless you have good reason to believe that the variances are heterogeneous.
If you have equal sample sizes, just use
and if you have unequal sample sizes, use
In each case, the df are the same as the df for error from the overall anova.
From what I know about the folklore of smoking, I might be led to believe that people who have quit smoking will be expected to gain weight. Therefore I predict that they will weigh more than people who have not quit. I would like to test that hypothesis using my data.
You might have a different hypothesis you want to test, but this is my example, so I have the ball. I say that because what is important is what the experimenter predicted before seeing the data, and that is what I would have predicted.
The means are given above as 71.65 and 69.52 for the 333 ex-smokers and the 1450 Smokers, respectively. The within-group variances are all very similar, so I'll use the error term from the Anova.
The critical value of t on 3865 df is approximately 1.96, so we cannot reject the null hypothesis. My belief in the effect of quitting smoking seems to be wrong. (The actual value of alpha is .097.)
Contrasts are another way of doing exactly what we have just done. These are covered in the text.
Show how to apply the above contrast using SPSS.
We have already discussed the Bonferroni procedure, so there isn't a lot to say about it here. The basic idea is that you divide alpha by the number of comparisons you are going to run, and then run each individual comparison at that level.
Suppose that I had planned to run two comparisons. The first was Smokers against Ex-smokers, and the second was Smokers against Non-smokers. The first test I have already run, and it gave me a t = 1.66, with a probability of .097. The second, which I don't show here, would give a t = 3.58, with a probability = .000.
To be significant, each test would have to have a probability value less than .05/2 = .025. The first one obviously does not, but the second was does. So we will conclude that there is a significant difference in the weight of Smokers and Non-smokers. Here our familywise error rate = .05.
Notice that, as presented here, the Bonferroni is an a priori test. I would only use this test if I wanted to make a small set of comparisons from a larger set of possible comparisons. If I did not have a priori tests, I would be much better off using one of the other procedures, because the Bonferroni will come out to be too conservative.
The Dunn-Sidak test is a very similar test, that is slightly more powerful. While the Bonferroni is based on the idea that with three independent tests, the probability of at least one being significant is approximately 3*.05 = .15. The Dunn-Sidak is based on the idea that this probability is actually 1 - .953 = .1426 if the comparisons are independent. (Not much of a difference!)
I cover multistage tests in the text, and I recommend them. I won't go over them here because they don't really fit with this example. The basic idea behind them is that if you have a lot of tests, it is not very likely that the null will really be true for all of them. The Bonferroni itself penalizes you as if that were the case.
Fisher's Least Significant Difference Test (LSD)
We have talked about this test before. Fisher argued that If the overall Anova is significant, you can go ahead and run multiple t tests between any and all groups.
Notice the requirement of a significant F.
This is the most liberal of the mutliple comparison tests, and it only keeps the familywise error rate at alpha if the complete (omnibus) null is true--i.e. if all populations have equal means.
This is the only test that requires a significant overall F before proceding!!!
I have been pushing this test for years for the situation in which you have only 3 groups, but people don't like it. Finally I came across a paper by Levin, Serlin, and Seaman (1994, Psychological Bulletin, 115, 153-159) that says the same thing.
Studentized Range Statistic
Many procedures use what is called the Studentized Range Statistic. It was originally designed as a statistic to compare the two extreme means in a set of means. If there are a lot of means, the extremes are likely to be more different than if there are just a few means. But that means that it is more likely to come up with a "significant" difference when testing those means. So the test was designed to adjust the critical value, making it larger when there are more means to chose from.
For some reason that I have never seen explained, they came up with a slightly different test statistic than the normal t statistic. There is no reason why they had to do so, the t would do as well. But the statistic is
Note the relationship of this to t with equal n's.
Notice that q is just the same as t except that the "2" is missing from in front of MSerror. This isn't a problem, because the critical values are altered in the same way.
This testing approach is used in many of the tests which follow, which is why I discussed it in the first place.
Newman-Keuls Test (Student-Newman-Keuls SNK)
I happen to like this test, but lots of people complain about it. I have laid out the reasons in the text, but I'll simplify them here.
- When there are three means, the Newman-Keuls holds the familywise error rate at .05, just as we would like.
- When there are four or five means, the error rate is held at (approx) .10.
- When there are six or seven means, the max FW is .15, etc.
- It is rare to have more than five groups in an experiment, and when we do it is also very likely that at least some null hypothesis is not true. It is hard to imagine an experiment where we really believe that all five means are equal.
- Thus I think the arguments against the Newman-Keuls are not really fair.
I go over how to apply this test by hand in the text, but people don't often do that, and I will probably cut that back drastically in the next edition..
In SPSS this test has a somewhat different printout than we have seen. I'm not sure why they do that. Basically they show you those groups that are homogeneous. The first example is the same set of data as the examples above.
None of these groups are different from any others. I don't know what the sig = .053 means, although it may be the significance level of the most extreme comparisons.
This is a good example, because the overall F was significant, but the test does not find any differences.
If we jump ahead to looking at weight change over 7 years as a function of smoking groups we get
Here you can see that 5 of the groups are homogeneous, but the sixth group (Quitter) is different from the other two.
What I can't display (because of the data I have) is the very common situation in which two homogeneous sets of groups have some overlap.
Unequal Sample Sizes and Heterogeneous Variances
The formula that I gave above assumes equal sample sizes. When you have unequal sample sizes, you can take the harmonic mean of the n's and use that for all cases. You can see from the printout above that this is what SPSS has done.
When the variances are unequal, you can use the Games and Howell (1976) procedure. (Unfortunately, a different Howell) SPSS will implement this procedure.
Tukey's test is a very close relative of the Newman-Keuls test. The difference is that all comparisons are done as if the groups were maximally far apart. In other words, with 6 groups, two means that are adjacent in an ordered series are still tested as if they were the largest and smallest of 6 means.
This test holds the familywise error rate at alpha regardless of what null hypothesis(ses) are true.
The following very curious printout comes from an analysis of the three original groups at baseline.
This shows that nonsmoker and Smokers are different.
But now look at the next part of the printout.
Notice that there are no significant differences using this test on these data.
I don't know why we get the difference between the two tables.
This shows that the Tukey a somewhat more conservative test than the Newman-Keuls. I think this test is a bit too conservative, but lots of people like it.
Ryan Procedure (REGWQ)
Abbreviation stands for Ryan, Einot, Gabriel, Welch q test.
Sort of like the Bonferroni logic, except that each test is run at a/(r/k) where k = number of means in the experiment, and r is the number of means from which these two are the largest and smallest. Einot, Gabriel, and Welch fiddled with this just a little bit, but the basic idea is still right.
This test keeps FW at alpha regardless of the true null, but is less conservative than Tukey's test.
SPSS will run this test. For our example the printout is shown in the following tables for the baseline and the endpoint data..
If I apply this test to our three-group example I get
This is a good example of overlapping homogeneous groups.
This test is the most conservative of the lot, and I do not recommend it. Only the purists like it.
I think that the Bonferroni is not a good test as a post-hoc test. I would only use it as an a priori test. Explain why.
The following is the LSD output:
The next is the Bonferroni output:
Finally, for the REGWQ we get
Again, I got conflicting results with the Tukey. Students can do that on their own.
The assignment is to take the means, etc. from what we have here, sit down with a pencil and paper and my book, and see what is going on with the conflicting Tukey results. It may have to do with different ways of treating unequal sample sizes.
Hint: You can find an exact probability of a t, for example, by COMPUTING a new variable named tprob. From the menu choose cdf(q,df). Put the actual t value in where the "q" is (I don't know why they don't call it "t", but they don't.) Put the df for error in place of df. The result will be the one-tailed probability value for a t > the obtained t. (I know that it is annoying that it will calculate that value for every case, but I don't know a way around it. If you wanted to know the value of t that cut off the lowest 2.5%, you could use the same compute statement except substitute idf(p,df) where p is the lower tail probability (e.g. .025) and df is the dferror. The result will be the critical value of t, and if you drop the sign it will be the two-tailed value. I think that this will help you solve the problem, but unfortunately you can't get a probability for q in the same way.
Last revised: 01/17/01