Randomization tests can be thought of as another way to examine data, and do not restrictive assumptions about populations. As a very quick example, suppose that you have two groups of scores. One came from subjects who were presented with a particular treatment, and the other came from a subjects who did not receive the treatment. The question is, "can we draw a conclusion about the effectiveness of the treatment by looking at thoses two sets of scores." And we won't make any assumptions about the distribution of scores, though we will assume that subjects were assigned to groups at random.
Let's start by assuming that the treatment had absolutely no effect. And let's assume that one participant had a score of 27. If the treatment had no effect, then that 27 would be equally likely to have come from the treatment group as it is to come from the control group. The same holds for all of the other scores. So let's set out by taking all of our data, tossing it in the air, and letting half of it fall in one group and the other half in the other group. That is an example of what we would expect if the treatment had no effect. Now let's calculate the mean, or perhaps the median, of each group, and then calculate the difference in the medians (or means). Record that number, and then toss the data in the air again, separate them at random into two groups, and again calculate the difference in the medians. Now keep doing this a great many times, say 10,000, each time recording the median difference. Those 10,000 differences are 10,000 examples of what you would expect if there were no treatment effect.
Now consider the data that you actually obtained. If there was no treatment effect, your obtained median difference would look like all of the other differences. But suppose that your median difference was guite large--totally unlike the kind of differences you found with random assignment of scores to groups. You would then conclude that the difference you actually found is totally unlike the differences you find when there is no effect, and you would likely conclude that the treatment actually worked--it made a difference how the scores were distributed. I am going to be a bit sloppy here and say that this result would lead us to reject the "null hypothesis." Notice that I have said nothing about normality and nothing about homogeneity of variance. In fact I have said nothing at all about population parameters. That is part of the nature of randomization, or "permutation," tests.
Now, of course, I do not expect you to sit around tossing your data in the air 10,000 times. That would be absurd. But there is nothing to prevent you from letting your computer do roughly that, and you would probably be amazed at how fast it can do so. Without a computer, randomization tests would be totally impractical, which is why they did not arise years ago. But with a computer, such tests are entirely practical and have several advantages of the usual parametric tests that fill most of our textbooks.
I have worked with randomization tests for a number of years. At one time I wrote a set of programs in Visual Basic that ran such tests and printed out the results if a very neat and attractive way. But no one uses Visual Basic any more. But you can do the same thing with other kinds of software, and, in particular, with R. Much of what follows is based on a set of R programs, but even if you do not want to play with R, you can learn a great deal just by reading the text. If you want to run such tests using SPSS, look at
Hayes, A. F. (1998). SPSS procedures for approximate randomization tests. Behavior Research Methods, Instruments & Computers, 30(3), 536-543.
If you want to use SAS, look at
Ru San Chen, R. S. & Dunlap, W. P.(1993) SAS procedures for approximate randomization tests. Behavior Research Methods, Instruments, & Computers 25 (3), pp 406–409.
Before discussing specific procedures, I need to say something, actually quite a lot, on the characteristics of randomization tests.
I need to say something about exchangeability. It applies to the null hypothesis sampling distribution--in other words, data are exchangeable under the null. Phil Good, for example, is a big fan of this term. He would argue that if the scores in one group have a higher variance than the other, then the data are not exchangeable and the test is not valid. BUT, if the hypothesis being tested is that treatments have no effect on scores, then under the null hypothesis why would one set of scores have a higher variance other than by chance? The problem is that we have to select the statistic to test with care. We normally test means, or their equivalent, but we also need to consider variances, for example, because that is another way in which the treatment groups could differ. If we are focussing on means, then we have to assume exchangeability including variance. But we need to be specific. So much for that little hobby horse of mine.
Cliff Lunneborg has written an excellent discussion of randomization tests. (Unfortunately, Cliff Lunneborg has died, and the paper is no longer available at his web site. By some good fortune, I happen to have copies of those pages. You can download files at "Paper-One and Paper Two. I consider these required reading to fully understand the underlying issues behind randomization tests. Lunneborg writes extremely well, but (and?) he chooses his words very carefully. Don't read this when you are too tired to do anything else--you have to be alert.
The basic approach to randomization tests is straightforward. I'll use the two independent group example, but any other example would do about as well.
This approach can be taken with any randomization test. We simply need to modify it to shuffle the appropriate values and calculate the appropriate test statistic. For example, with multiple conditions we will shuffle the data, assign the first n1 cases to treatment 1, the next n2 cases to treatment 2, and so on, calculate an F statistic on the data, consider whether or not to increment the counter, reshuffle the data, calculate F, and so on. In some cases, for example: factorial analysis of variance designs, the hardest question to answer is "What should be shuffled?"
In a previous section I suggested that the phrase "permutation test" implies that we take all possible permutations of the data (or at least all possible combinations). That is often quite impossible. Suppose that we have three groups with 20 observations per group. There are 60!/(20!*20!*20!) possible different combinations of those observations into three groups, and that means 5.7783*1026 combinations, and even the fastest supercomputer is not up to drawing all of those samples. (That is more than the estimated number of stars in the universe.) I suppose that it could be done if we really had to do it, but it certainly wouldn't be worth waiting around all that time for the answer.
The solution is that we take a random sample of all possible combinations. That random sample won't produce an exact answer, but it will be so close that it won't make any difference. (This is why these tests are sometimes referred to as "approximate tests, though in reality they are no more "approximate" than standard parametric tests which rely on all sorts of "ify" approximations.) The results of 5,000 or 10,000 samples will certainly be close enough to the exact answer to satisfy any reasonable person. (The difference will come in the 4th decimal place or beyond). (There is even a sense in which that approach can be claimed to be exact, but we won't go there.)
When we draw random samples to estimate the result of drawing all possible samples, we often refer to that process as Monte Carlo sampling. I could use that term to distinguish between cases where the number of combinations is small enough to draw all of them, and cases where that is not practical. However there is little to be gained by adding another term, and I will simply use the phrase "randomization tests" for both approaches.
For the pages that follow I will base the calculations on a computing language called R. (Earlier in this page I gave links to discussions of using SPSS and SAS to carry out resampling computations.) The R programming language has become very popular in the last few years, and it is something that you can download for free and run without very much difficulty--well, sort of. For every example I will provide the complete code for running the analysis. The code that I provide will deliberately be "wordy," meaning that I will skip many shortcuts and let you see exactly what I am doing. Others have written packages that will do what I do more easily, but you won't learn much about R that way. So please don't complain about all of the separate steps and the comments that have been added. They are intended to make your learning easier. I have put together a set of pages related to writing code in R. They leave a great deal to be desired, but you can get to the first of them at Introducing R. The point of those pages is not really to teach you to be an R programmer, but to give you some idea of what R code looks like and to address a few very basic issues. You can learn quite a bit about writing R programs from those links.You will see many more pages of code before we're through.
I have broken these many pages down into three sections. The first deals with the pages that are basically designed to explain the logic and structure of resampling tests. I have labled these "Background Material" for obvious reasons. I then move to what I have called "Randomization Tests." These are pages that deal with specific tests, such as comparing two group means. In R I have deliberately pulled coding steps apart, when I could have shortened the code by using more complex commands. I did this to make it easier for you to follow what I am doing. After the randomization tests I provide a section on bootstrapping. This is a bit shorter simply because topics on bootstrapping are more easily covered and are fewer.
David C. Howell
University of Vermont