Overview of Randomization Tests

Randomization tests can be thought of as another way to examine data, and do not restrictive assumptions about populations. As a very quick example, suppose that you have two groups of scores. One came from subjects who were presented with a particular treatment, and the other came from a subjects who did not receive the treatment. The question is, "can we draw a conclusion about the effectiveness of the treatment by looking at thoses two sets of scores." And we won't make any assumptions about the distribution of scores, though we will assume that subjects were assigned to groups at random.

Let's start by assuming that the treatment had absolutely no effect. And let's assume that one participant had a score of 27. If the treatment had no effect, then that 27 would be equally likely to have come from the treatment group as it is to come from the control group. The same holds for all of the other scores. So let's set out by taking all of our data, tossing it in the air, and letting half of it fall in one group and the other half in the other group. That is an example of what we would expect if the treatment had no effect. Now let's calculate the mean, or perhaps the median, of each group, and then calculate the difference in the medians (or means). Record that number, and then toss the data in the air again, separate them at random into two groups, and again calculate the difference in the medians. Now keep doing this a great many times, say 10,000, each time recording the median difference. Those 10,000 differences are 10,000 examples of what you would expect if there were no treatment effect.

Now consider the data that you actually obtained. If there was no treatment effect, your obtained median difference would look like all of the other differences. But suppose that your median difference was guite large--totally unlike the kind of differences you found with random assignment of scores to groups. You would then conclude that the difference you actually found is totally unlike the differences you find when there is no effect, and you would likely conclude that the treatment actually worked--it made a difference how the scores were distributed. I am going to be a bit sloppy here and say that this result would lead us to reject the "null hypothesis." Notice that I have said nothing about normality and nothing about homogeneity of variance. In fact I have said nothing at all about population parameters. That is part of the nature of randomization, or "permutation," tests.

Now, of course, I do not expect you to sit around tossing your data in the air 10,000 times. That would be absurd. But there is nothing to prevent you from letting your computer do roughly that, and you would probably be amazed at how fast it can do so. Without a computer, randomization tests would be totally impractical, which is why they did not arise years ago. But with a computer, such tests are entirely practical and have several advantages of the usual parametric tests that fill most of our textbooks.

I have worked with randomization tests for a number of years. At one time I wrote a set of programs in Visual Basic that ran such tests and printed out the results if a very neat and attractive way. But no one uses Visual Basic any more. But you can do the same thing with other kinds of software, and, in particular, with R. Much of what follows is based on a set of R programs, but even if you do not want to play with R, you can learn a great deal just by reading the text. If you want to run such tests using SPSS, look at

Hayes, A. F. (1998). SPSS procedures for approximate randomization tests. Behavior Research Methods, Instruments & Computers, 30(3), 536-543.

If you want to use SAS, look at

Ru San Chen, R. S. & Dunlap, W. P.(1993) SAS procedures for approximate randomization tests. Behavior Research Methods, Instruments, & Computers 25 (3), pp 406–409.

An Elaboration

Before discussing specific procedures, I need to say something, actually quite a lot, on the characteristics of randomization tests.

Randomization tests differ from parametric tests in almost every respect.

There is no requirement that we have random samples from one or more populations—in fact we usually have not sampled randomly.
Why do we worry about random samples with parametric tests? Because we use the mean and variance to estimate the population parameters, and we need random samples from some population in order to have legitimate estimators. Randomization tests don't estimate parameters.
Parametric tests need to assume normality so that our test statistic, such as t, follows a t distribution.

For resampling tests we rarely think in terms of the populations from which the data came, and there is no need to assume anything about normality or homoscedasticity.
Our null hypothesis has nothing to do with parameters, but is phrased rather vaguely, as, for example, the hypothesis that the treatment has no effect on the how participants perform. That is why I earlier put "null hypothesis" in quotation marks.

This is an important distinction. The alternative hypothesis is simply that different treatments have an effect. But, note that we haven't specified whether the difference will reveal itself in terms of means, or variances, or some other statistic. We leave that up to the statistic we calculate in running the test.
That might be phrased a bit more precisely by saying that, under the null hypothesis, the score that is associated with a participant is independent of the treatment that person received.

Because we are not concerned with populations, we are not concerned with estimating (or even testing) characteristics of those populations.

We do calculate some sort of test statistic, however we do not compare that statistic to tabled distributions.
Instead, we compare it to the results we obtain when we repeatedly randomize the data across the groups, and calculate the corresponding statistic for each randomization.

Even more than parametric tests, randomization tests emphasize the importance of random assignment of participants to treatments.

This is very important because we make statements of the form "If treatments had no effect, that particular score could just as easily ended up in the second group instead of the first." You need random assignment to do that.
I need to hedge a bit here. If the groups are males and females, you obviously cannot randomly assign subjects to groups. But you need to assume that, conditional on gender, there are no other systematic differences in group assignment.

We aren't even sure what to call thes tests. I refer to them as "randomization" tests, and that is probably the most common name. But others sometimes refer to them as "permutation" tests, but that is not accurate because we look at different "combinations" of scores, not different "permutations." I mention this because it may make your life easier if you are looking through an index.

Exchangeability

I need to say something about exchangeability. It applies to the null hypothesis sampling distribution--in other words, data are exchangeable under the null. Phil Good, for example, is a big fan of this term. He would argue that if the scores in one group have a higher variance than the other, then the data are not exchangeable and the test is not valid. BUT, if the hypothesis being tested is that treatments have no effect on scores, then under the null hypothesis why would one set of scores have a higher variance other than by chance? The problem is that we have to select the statistic to test with care. We normally test means, or their equivalent, but we also need to consider variances, for example, because that is another way in which the treatment groups could differ. If we are focussing on means, then we have to assume exchangeability including variance. But we need to be specific. So much for that little hobby horse of mine.

Cliff Lunneborg has written an excellent discussion of randomization tests. (Unfortunately, Cliff Lunneborg has died, and the paper is no longer available at his web site. By some good fortune, I happen to have copies of those pages. You can download files at "Paper-One and Paper Two. I consider these required reading to fully understand the underlying issues behind randomization tests. Lunneborg writes extremely well, but (and?) he chooses his words very carefully. Don't read this when you are too tired to do anything else--you have to be alert.

The Null Hypothesis

₁

₂

Philosophy of Resampling Procedures

Basic Approach

The basic approach to randomization tests is straightforward. I'll use the two independent group example, but any other example would do about as well.

Decide on a metric to measure the effect in question.

For this example I will use the t statistic, though several others are possible and equivalent, including the difference between the means or the mean of the first group. (Most discussions of this specific test would focus on the difference between means, but I will stick with the traditional Student's t test because that makes for a better parallel between randomization and parametric tests.)

Calculate that test statistic on the data (here denoted t_obt).
Repeat the following nreps times, where nreps is the number of desired replications and is usually a number greater than 1000

Shuffle the data
Assign the first n₁ observations to the first condition, and the remaining n₂ observations to the second condition.
Calculate the test statistic (here denoted t_i*) for the reshuffled data.
If t_i* is greater than t_obt increment a counter by 1.

I would normally use absolute values, because I want a two-tailed test.

Continue this procedure nreps times.

Divide the value in the counter by nreps, to get the proportion of times the t on the randomized data exceeded the t_obt on the data we actually obtained.
This is the probability of such an extreme result under the null.
Reject or retain the null on the basis of this probability.

This approach can be taken with any randomization test. We simply need to modify it to shuffle the appropriate values and calculate the appropriate test statistic. For example, with multiple conditions we will shuffle the data, assign the first n₁ cases to treatment 1, the next n₂ cases to treatment 2, and so on, calculate an F statistic on the data, consider whether or not to increment the counter, reshuffle the data, calculate F, and so on. In some cases, for example: factorial analysis of variance designs, the hardest question to answer is "What should be shuffled?"

The Monte Carlo approach

In a previous section I suggested that the phrase "permutation test" implies that we take all possible permutations of the data (or at least all possible combinations). That is often quite impossible. Suppose that we have three groups with 20 observations per group. There are 60!/(20!*20!*20!) possible different combinations of those observations into three groups, and that means 5.7783*10²⁶ combinations, and even the fastest supercomputer is not up to drawing all of those samples. (That is more than the estimated number of stars in the universe.) I suppose that it could be done if we really had to do it, but it certainly wouldn't be worth waiting around all that time for the answer.

The solution is that we take a random sample of all possible combinations. That random sample won't produce an exact answer, but it will be so close that it won't make any difference. (This is why these tests are sometimes referred to as "approximate tests, though in reality they are no more "approximate" than standard parametric tests which rely on all sorts of "ify" approximations.) The results of 5,000 or 10,000 samples will certainly be close enough to the exact answer to satisfy any reasonable person. (The difference will come in the 4th decimal place or beyond). (There is even a sense in which that approach can be claimed to be exact, but we won't go there.)

When we draw random samples to estimate the result of drawing all possible samples, we often refer to that process as Monte Carlo sampling. I could use that term to distinguish between cases where the number of combinations is small enough to draw all of them, and cases where that is not practical. However there is little to be gained by adding another term, and I will simply use the phrase "randomization tests" for both approaches.

Computing With R

For the pages that follow I will base the calculations on a computing language called R. (Earlier in this page I gave links to discussions of using SPSS and SAS to carry out resampling computations.) The R programming language has become very popular in the last few years, and it is something that you can download for free and run without very much difficulty--well, sort of. For every example I will provide the complete code for running the analysis. The code that I provide will deliberately be "wordy," meaning that I will skip many shortcuts and let you see exactly what I am doing. Others have written packages that will do what I do more easily, but you won't learn much about R that way. So please don't complain about all of the separate steps and the comments that have been added. They are intended to make your learning easier. I have put together a set of pages related to writing code in R. They leave a great deal to be desired, but you can get to the first of them at Introducing R. The point of those pages is not really to teach you to be an R programmer, but to give you some idea of what R code looks like and to address a few very basic issues. You can learn quite a bit about writing R programs from those links.You will see many more pages of code before we're through.

Structure of these pages

I have broken these many pages down into three sections. The first deals with the pages that are basically designed to explain the logic and structure of resampling tests. I have labled these "Background Material" for obvious reasons. I then move to what I have called "Randomization Tests." These are pages that deal with specific tests, such as comparing two group means. In R I have deliberately pulled coding steps apart, when I could have shortened the code by using more complex commands. I did this to make it easier for you to follow what I am doing. After the randomization tests I provide a section on bootstrapping. This is a bit shorter simply because topics on bootstrapping are more easily covered and are fewer.

Background Material

Randomization Tests

Bootstrapping Procedures

References

Edgington, E. & Onghena, P. (2007) Randomization tests, New York, Chapman & Hall.

Efron, B. & Tibshirani, R. J. (1993) An introduction to the bootstrap. New York: Chapman and Hall.

Lunneborg, C. E. (2000) Random assignment of available cases: Let the inference fit the design. http://faculty.washington.edu/lunnebor/Australia/randomiz.pdf

dch:

David C. Howell
University of Vermont
David.Howell@uvm.edu