Parametric and Resampling Statistics (cont.):

The Null Hypothesis

David C. Howell

University of Vermont

On top of not worrying about assumptions, the randomization/permutation folks don't even set up their null hypotheses the same way the parametric folks do. In fact, the first reference to a null hypothesis in Edgington's (1995) index, though certainly not in the text, is on page 347. Lunneborg (2000) has the first reference on page 209, and Good (2000), while bringing the null hypothesis in on page 25, references it by writing "Under the null hypothesis of no differences among the various experimental or survey groups, ... ." What kind of a self-respecting null hypothesis is that? How could he have forgotten to mention the bit about "μ1 = μ2 = μ3 = ... = μk?" Well, he left that out because that isn't his null hypothesis. His null hypothesis is pretty much what he said--"the different treatments had the same effect," and not "the different populations had the same mean."

Let me expand on the previous paragraph by quoting a very important passage from Edgington (1986).

"Just as the reference set (read as "sampling distribution" for now) of data permutations is independent of the test statistics, so is the null hypothesis. A difference between means may be used as a test statistic, but the null hypothesis does not refer to a difference between means. The null hypothesis, no matter what test statistic is used, is that there is no differential effect of the treatments for any of the subjects. ... Thus the alternative hypothesis is that the measurement of at least one subject would have been different under one of the other treatment conditions. Inferences about means must be based on nonstatistical considerations; the randomization test does not justify them." (p. 531)

This idea is so important for resampling statistics, that I want to follow with another quotation from Edgington (1995, p. 141).

"As a consequence there is a tendency to think of the null hypothesis tested as being reflected in the test statistic employed, which can be quite misleading in the case of randomization tests.

In Chapters 4 and 5, the use of randomization test statistics involving differences between means did not imply a test of a null hypothesis of no difference in mean treatment effects. The null hypothesis was no difference in the treatment effect for any subject, and that is the same null hypothesis tested if we expect a difference in variability of treatment effects and use a test statistic sensitive to that property. Whether we employ a test statistic sensitive to mean differences, differences in variability, or differences in skewness of treatment effects depends on our expectation of the nature of the effect that may exist, but the choice does not alter the null hypothesis that is tested, which is that of no treatment effect ... ."

Well, I don't think that he could be any clearer than that or much more emphatic.

It should be obvious that we are talking about quite a different null hypothesis with randomization tests than we are with the more traditional parametric tests. This is why people can speak of the nonparametric test's freedom from assumptions. If we only assume that the treatment had no effect, then homogeneity of variance is not one of the things that we assume; it is one of the things that we test. And if we run a randomization test and reject the null hypothesis, that might mean that the reason why we have rejected the hypothesis is because one treatment led to a larger mean, or greater variability, or a more skewed distribution than did the other. The randomization tests that we normally apply are more sensitive to one kind of difference (usually a difference in location) than to other kinds of differences, but that is due to the test statistic we select. That is one of the strengths of randomization tests—they allow us to choose a test statistic that addresses specific kinds of differences, rather than forcing one on us.

This takes us back to our underlying assumptions. For a randomization test, the primary assumption is "exchangeability." This basically comes down to the assumption that if the null hypothesis is true, the labels assigning subjects to groups are interchangeable. In other words, if the treatment had no effect, a person would have the same score, no matter which group he or she was assigned to. Thus, even after the data have been collected, the mean of what we have called Group One would have the same expectation after we shuffled subjects among groups. This is why we will create our reference distribution (sampling distribution) by taking the data at hand, shuffling them, and reassigning them to groups. When we do this repeatedly, the results we obtain will be the reference distribution for the null hypothesis.

Notice one other thing about Edgington's definition. He said "there is no differential effect for any of the subjects." That is not a phrase that you would see in a null hypothesis written with regard to a standard parametric test. In parametric tests the individual subject does not play quite as central a role—if most subjects in Treatment 1 had a higher score than most subjects in Treatment 2, but some did not, that wouldn't worry anyone or affect the parametric test unless it lead to violations of one or more assumptions. But the randomization test basis its logic on what would happen with individual subjects, and this becomes important. The point becomes even more apparent when we speak about randomization tests for interactions in factorial designs.

As an aside, to illustrate just what we mean by exchangeability, and how it influences the way we run our tests, assume that we had only two subjects, each measured under three different treatments. Assume that the data were as follows:

Subject Treatment 1 Treatment 2 Treatment 3

If treatments really had no effect, the 5 in the first row was equally likely to have fallen to Treatments 1, 2, or 3, and we could shuffle the three numbers in that row without hurting anything. The same goes for our ability to shuffle the three numbers in row 2, because under the null they were equally likely to have landed in any treatment. But, our null hypothesis does not say anything about subjects or participants. Even if treatments don't have any effect, it is unreasonable to think that the 15 in column 2 could have just as easily been a score assigned to the first subject. If these scores are measures of depression, then the second subject certainly looks to be much more depressed overall than the first, and it would not make sense to shuffle scores vertically, though it does make sense to shuffle scores horizontally. The point to be made here is that our null hypothesis tells us what kinds of exchangeability are acceptable, as well as telling us, by omission, what kinds of exchangeability are not reasonable. This means that the null hypothesis will govern the way we actually carry out the test.