
Last week I talked about Effect Size for analysis of variance designs. Everything that I said there was correct, but Friday I read an interesting discussion by Corina and ?? (19??) on effect size measures. They made the point that when you have an anova design, you most often still want to talk about the effect size of a difference between two groups (or comparisons of several sets of two groups). I think this makes sense. This next example, though not intended as an example of Effect Size, makes the point nicely. I most probably just want to tell my readers about the size of the difference between antibuse and non-antibuse treatments
This is the study by Carrol (1998) that we had on the exam in December. It looks at the effect of antibuse as an adjunct to drug abuse treatment.
She had Cognitive Behavior Therapy (CBT) and a 12-Step Facilitation Therapy (TSF). She crossed these with antibuse/no antibuse, and tossed in an additional group (Clinical Management) that only had the antibuse component.
The dependent variable was the number of weeks abstinent.
|
Group |
TSF |
CBT |
CM/Disulf |
TSF/Disulf |
CBT/Disulf |
|
Mean |
2.22 |
1.83 |
2.59 |
3.76 |
4.54 |
|
St.
dev. |
3.02 |
2.03 |
3.74 |
3.76 |
4.54 |
n
|
23 |
18 |
27 |
25 |
24 |

Notice that the overall F is not significant. That won't stop us, but it ought to make us think.
Multiple comparison procedures can be roughly divided into Planned and Unplanned, or a priori and post hoc test. In some ways the issue is more a question of how many comparisons you want to run, rather than when they were planned, though it isn't fair to look at the data and decide that only two contrasts are now of interest to you. If you haven't planned in advance, it is equivalent to running all possible comparisons.
Here I would put
Standard contrasts
which are really just another way of doing t tests. The advantage of contrasts is that they allow you to make some interesting comparisons that involve several groups.
Trend analysis
which I will talk about later, but which is a variant of the standard contrasts. (We saw an example on Thursday.)
Fisher's LSD test
Some would put this off with the unplanned, but the point is that you can make whatever tests you want, and I normally think of it as making a limited set of comparisons.
Bonferroni tests
These are not so much a way of doing the arithmetic as a way of thinking about how to adjust to keep error rates in line.
I'll take each of these in turn, but go quickly because there isn't a lot to say.
Standard Contrasts
I didn't say much about these on Thursday, but to run them we define a set of coefficients that make a contrast between groups.
e.g., Suppose that we want to compare the antibuse groups with the others.
Groups: TSF CBT CM/Ant TSF/Ant CBT/Ant
Coeff: 1/2 1/2 -1/3 -1/3 -1/3
Means: 2.22 1.83 2.59 3.76 4.54
If we (deliberately) ignore the issue of unequal sample sizes, then
1/2(2.22) + 1/2(1.83) - 1/3(2.59) - 1/3(3.76) - 1/3(4.54)
= (2.22 + 1.83)/2 - (2.59 + 3.76 + 4.54)/3 = 2.025 - 3.63 = -1.605
This result (-1.605) is the difference between the (unweighted) averages of the two sets of groups.
The harmonic mean of the sample sizes is 22.963, and I'll use that. The answer will be approximate, but pretty close.
Here we can declare that those subjects who take antibuse outperform (on average) those who do not.
In the text I do this in the context of an F instead of a t, but the difference is merely that F = t2.
This would be a place to talk about Effect Size, but I'll skip it here.
In the book I talk about orthogonal contrasts, but that's not a big deal. Don't worry about orthogonality.
The important thing about contrasts is that they appear all over the place, and are really nothing but t tests, or F tests, using some set of coefficients to weight the groups so as to lead to the comparisons we want.
Trend Analysis
I cover trend analysis in the text at the end of the chapter. It is worth your time to read that over so that you understand what trend analysis is and when it would be used. It is probably not all that important to give another example here. I did create an example of a cubic trend, and that is available if you have trouble understanding what a cubic trend is.
The use of trend analyses seems to change over time. At the moment it seems on the ascent. It is very useful when you want to pull some specific things out of the data.,
Fisher's Least Significant Difference test
I think that we have beat this to death, but there is one point that caused confusion on last year's exam.
This is the only test that requires a significant overall F before continuing. The others do not, and it would, in fact, make them conservative to impose this requirement.
If the complete null hypothesis is true, and all population means are equal, Fisher has argued that you can hold the familywise error rate at alpha = .05 just by requiring that the overall F be significant before you make any multiple comparisons.
If there are only three means, this test always holds alpha at .05. If there are 4 or 5 means, alpha can be as high as about .10.
Explain why this is true. Well, probably skip it to save time.
Lot's of people don't like this test because they think that it is too liberal. I don't agree with them, but some journal editors probably will.
The Bonferroni tests
Here we don't really care what kind of tests you run. In fact, you could probably use just about any test, including a chi-square if you could figure out how to sneak it in.
The point is that when you have run your several tests (supposing that there are c of them), you divide alpha by c and require that each test be significant at that level.
This has the advantage of holding the maximum familywise error rate at .05, though it is a bit conservative.
Modified Bonferroni tests
These tests are run in ways similar to the Bonferroni discussed above, but here we keep redefining c as we go along. We start with c equal to the number of tests we hope to run. Then after we have run one of them, if it is significant, we reduce c by one before going on to the next.
The idea here is that c stands for the number of null hypotheses that could be true. Once we have rejected one, we have concluded that it is not true, and that reduces the number of remaining nulls that could be true.
We stop whenever we find a difference that is not significant, no matter where that is.
These tests are among the most powerful multiple comparison tests we have, and I recommend them. They can be used in any setting where we have a bunch of tests, with known p values, and want to keep from making too many Type I errors.
BUT I strongly recommend against using the Bonferroni when you are making all possible pairwise comparisons among means. It is much too conservative in that case. It is best used when you only want to make a few comparisons.
Dunn-Sidák Test
This test if very much like the Bonferroni, but whereas the Bonferroni was based on an inequality that said that the maximum possible familywise error rate, when working at a per comparison, is c*a, Sidák used a more precise inequality that says that for independent tests the maximum possible familywise error rate would be p(FW) = 1 - (1 - a)c . So Sidák just runs a standard t, but instead of evaluating it at a' = a/c, he evaluates it at
a' = 1 - (1 - a)1/c . This just makes the test a tiny bit more powerful.
These tests are used when you have not decided on your comparisons before you collect your data, and you are, in effect if not in practice, running all pairwise comparisons among means.
This test compares every mean with every other mean using a Studentized Range Statistic.
In the book I call it a controversial test, and it is. I still like it, but I'm in a minority. The reason for leaving it in is that it is a good introduction to the tests that follow. But in the next version I will probably take it out because we don't do the calculations that way any more, and therefore lose the "introductory" value.
You can think of the Studentized Range as a standard t test where you adjust the critical value of t on the basis of how many means are in the set you are looking at.
I talked about the Studentized Range Statistic last semester. The Studentized Range Test was thought up long before Newman and Keuls came around, and it was designed to replace the standard Anova with a test that only compares the largest and the smallest mean in an experiment.
Just as an illustration, I created purely random data for an experiment with two groups, and for an experiment with 6 groups. The means for the two-group experiment were 45.61 and 47.63. The largest and smallest means of the 6 groups were 45.61 and 51.87. Obviously we are more likely to find (an erroneous) difference if we compare the 45.61 vs 51.87 groups, than if we compare the means from the two-group experiment.
The Studentized Range Statistic adjusts the probability values to take that into account.
The test is basically just a t test between the largest and smallest means. But instead of using t, they use a variant called q. The formula is almost the same, except for a missing square root of 2 in the denominator. But the tables take this into account.
Assuming equal sample sizes, we have
Now back to the Newman-Keuls:
To run the test you put the means in an ordered series and then compare the highest and lowest means in that set. The number of means in the set (the number of means of which these two are the highest and lowest) is denoted as r You then get your critical value by taking into account the value of r.
If a difference is significant, you then set one of those means aside and compare the largest and the smallest of those that remain, adjusting r accordingly.
This test holds alpha at .05 familywise against the complete null hypothesis. But if the complete null is not true, it allows alpha to float up, which is classed as "a bad thing."
With three means, familywise alpha will equal .05. With 4 or 5 means it will equal nearly .10, and we don't worry about more than 5 means because we don't do those kinds of experiments.
Example from Carroll's study:
Here we don't find any significant differences, which surprises me.
When we compare this to the contrast we ran earlier, we see the relative power of a priori and post hoc tests.
This test is run exactly like the Newman-Keuls except that r is always kept at k, where k is the number of groups in the experiment.
The test is more conservative than the Newman-Keuls, but it keeps alpha at .05 for all pairwise comparisons, regardless of the number of means in the study.
This is one of the most popular tests for this purpose.
There is no point at running it here, because it is more conservative than SNK, and SNK did not find any differences.
This test is recommended only when you are running a bunch of complex contrasts. Scheffe himself recommends against its use for a bunch of pairwise tests. It is very conservative.
This procedure holds the error rate at alpha (familywise) while allowing r to shrink slightly as the test proceeds. As such, it is a compromise between Newman-Keuls and Tukey. I recommend this test when you have it available.
It, too, does not produce significant differences here.
There are a number of these tests, and their point is to handle the case where we cannot seriously accept the assumption of homogeneity of variance and where the ns are quite different.
I won't go over them here, but I have a handout showing the Games-Howell procedure. You don't need to be able to run one off the top of your head, but you do need to know that they exist and what they are used for. There may come a time when you are glad you have this to look at, but please don't spend too much time on it now.
If you have any further questions on these topics, I would ask that you write them out very specifically and I will attempt to answer them. If you see errors, or vagueness, in this document, please point it out to me.
Last revised: 01/19/02