Randomization Tests

The preceding pages have dealt with bootstrapping estimates of parameters. In general, when we speak of bootstrapping we are generally speaking about techniques for estimating population parameters, such as the population mean or median, or the difference between two medians. Although we can use such estimates to test hypotheses, those who have developed the bootstrapping procedures were concentrating on parameter estimation.

Randomization tests, on the other hand, are aimed primarily at standard hypothesis testing. In general, we draw many repeated samples under the condition that the null hypothesis is true, and then reject the null if we find that our obtained statistic is not like the statistics we find with a true null.

With randomization tests, we do not even assume that the raw data that we collected represent the actual shape of the parent population. We don't even care what the parent populations even looked like. In this sense, we are making even fewer assumptions. We aren't even concerned with whether we have sampled randomly from a population, and frequently we have not. Random sampling is not an issue with randomization tests. We simply take the sample data as given, and ask the question "Given these data, what are the possible ways that they could have come up if the null hypothesis were true?" Suppose that we had two groups--a control condition with scores of 17, 21, and 23 on some dependent variable, and a treatment condition with scores of 22, 25, 25, and 26. If the treatment had no effect on scores, the first number that we sampled (17) could just as easily have been found in the second group as in the first. With 3 observations in the Control condition, and 4 observations in the treatment condition, and if the null hypothesis is true," any 3 of those 7 observations could equally well have landed in Control condition, with the remainder landing in the Treatment condition. The data are "exchangeable" between conditions. After calculating all of the possible arrangements of the 7 observations into one group of 3 and another group of 4, we calculate the relevant test statistic for each arrangement, and compare our obtained statistic to that sampling distribution. We then reject or retain the null. There are those who argue that randomization tests are not even concerned with populations, because it doesn't make sense to talk about a population if you don't have random sampling. On the other hand, what I have said elsewhere about Efron and Tibshirani's statement of the null (in their section on permutation tests) is definitely in conflict with that. (See link to null hypothesis in the next section.) There is an excellent discussion of randomization tests by Lunneborg that is available over the web at http://faculty.washington.edu/lunnebor/Australia/randomiz.pdf. I consider this required reading to understand the underlying issues behind randomization tests. Lunneborg writes extremely well, but (and?) he chooses his words very carefully. Don't read this when you are too tired to do anything else--you have to be alert.

At this point it might be useful to skim the first part of the page on a randomization test between the means of two samples. If you took my advice and looked at the randomization test for the difference between two means, you saw that we had data from two conditions in which we record the amount of time that it took someone to leave their parking place after that had arrived at their car. In one condition another driver was waiting for the space, and in the other condition there was no one waiting. The question concerns whether the presence of someone who wants the parking space affects the time that it takes the parked driver to leave. Although what follows refers to randomization tests in general, I need to focus on an example, and the example I'm focusing on involves differences between two independent groups.

The Null Hypothesis

As I said when talking about bootstrapping, the issue of the null hypothesis is a bit murkier with nonparametric tests in general than it is with parametric tests. At the very least we swap loose generic terms like "location" for more specific terms like "mean." And we generally substitute some vague statement, such as "having someone waiting will affect the time it takes to back out," for precise statements like "m₁ = m₂." This vagueness gains us some flexibility, but it also makes the interpretation of the test more difficult.

Elsewhere I have spoken about the null hypothesis in conjunction with the bootstrapping approach to a one-way analysis of variance. I put it there because that is where I was when I thought of it, and I can't quite think of a better place to put it. It is an important concept, and I suggest that you look at it.

Basic Approach

The basic approach to randomization tests is straightforward. I'll use the two independent group example, but any other example would do about as well.

Decide on a metric to measure the effect in question.
- For this example I will use the t statistic, though several others are possible.
Calculate that test statistic on the data (here denoted t_obt).
Repeat the following N times, where N is a number greater than 1000
- Shuffle the data
- Assign the first n₁ observations to the first condition, and the remaining n₂ observations to the second condition.
- Calculate the test statistic (here denoted t₁*) for the reshuffled data.
- If t₁* is greater than t_obt increment a counter by 1.
  - I would normally use absolute values, because I want a two-tailed test.
Divide the value in the counter by N, to get the proportion of times the t on the randomized data exceeded the t on the data we actually obtained.
This is the probability of such an extreme result under the null.
Reject or retain the null on the basis of this probability.

This approach can be taken with any randomization test. We simply need to modify it to shuffle the appropriate values and calculate the appropriate test statistic. For example, with multiple conditions we will shuffle the data, assign them to the appropriate number of conditions, calculate an F statistic on the data, consider whether or not to increment the counter, reshuffle the data, calculate F, and so on. In some cases, the hardest question to answer is "What should be shuffled?"

Terminology

The tests that I am discussing seem to go under a variety of names, which doesn't make things any easier. The general phrase "resampling tests" applies to any situation in which the test is based on resampling scores from some pool of data. Bootstrapping and randomization tests are both examples of resampling tests.

A name that has been around for some time is "permutation tests." It refers to the fact that with randomization tests we permute the data into all sorts of different orders, and then calculate our test statistic on each permutation. The only problem with this name, as I see it, is that we aren't really taking permutations--we are taking different combinations. Take an example of two groups with scores 3, 6, 7 and 5, 8, 9. We want to examine all possible ways of assigning 3 of those six values of group one, and the rest to group two. But we don't distinguish the case where group one had 3, 8, 9 from the case where it had 8, 9, 3. These are the same combination, and will give the same mean, median, variance, etc. So it is the different combinations, not permutations, that we care about. I could call them "combination tests," but I would be the only one who did, and I'd look pretty funny hanging out there all by myself.

"Permutation (or combination) tests" has another implication that we need to worry about. It implies, without stating it, that we take all possible permutations. That is often practically impossible, as we will see in a minute.

The phrase "randomization test" is a nice compromise, because it avoids the awkwardness of "permutation," and doesn't suggest anything about the number of samples. It is also very descriptive. We randomize (i.e., randomly order) our data, and then calculate our statistic on those randomized data. I will try to restrict myself to that label.

The Monte Carlo approach

In the previous section I suggested that the phrase "permutation test" implies that we take all possible permutations of the data (or at least all possible combinations). That is often quite impossible. Suppose that we have three groups with 20 observations per group. There are 60!/(20!*20!*20!) possible different combinations of those observations into three groups, and that means 5.7783*10²⁶ combinations, and even the fastest supercomputer is not up to drawing all of those samples. I suppose that it could be done if we really had to do it, but it certainly wouldn't be worth waiting around all that time for the answer.

The solution is that we take a random sample of all possible combinations. That random sample won't produce an exact answer, but it will be so close that it won't make any difference. The results of 5000 samples will certainly be close enough to the exact answer to satisfy any reasonable person. (The difference will come in the 3rd decimal place or beyond). (There is even a sense in which that approach can be claimed to be exact, but we won't go there.)

When we draw random samples to estimate the result of drawing all possible samples, we often refer to that process as Monte Carlo sampling. I could use that term to distinguish between cases where the number of combinations is small enough to draw all of them, and cases where that is not practical. However there is little to be gained by adding another term, and I will simply use the phrase "randomization tests" for both approaches.

References

Return

Efron, B. & Tibshirani, R. J. (1993) An introduction to the bootstrap. New York: Chapman and Hall.

Lunneborg, C. E. (2000) Random assignment of available cases: Let the inference fit the design. http://faculty.washington.edu/lunnebor/Australia/randomiz.pdf

Last revised: 07/09/2003