The preceding pages have dealt with bootstrapping estimates of parameters. In general, when we speak of bootstrapping we are generally speaking about techniques for estimating population parameters, such as the population mean or median, or the difference between two medians. Although we can use such estimates to test hypotheses, those who have developed the bootstrapping procedures were concentrating on parameter estimation.

Randomization tests, on the other hand, are aimed primarily at standard hypothesis testing. In general, we draw many repeated samples under the condition that the null hypothesis is true, and then reject the null if we find that our obtained statistic is not like the statistics we find with a true null.

With randomization tests, we do not even assume that the raw data that we
collected represent the actual shape of the parent population. We don't even
care what the parent populations even looked like. In this sense, we are
making even fewer assumptions. We aren't even concerned with whether we have
sampled randomly from a population, and frequently we have not. Random sampling
is not an issue with randomization tests. We simply take the sample data as given, and ask
the question "Given these data, what are the possible ways that they could
have come up if the null hypothesis were true?" Suppose that we had
two groups--a control condition with scores of 17, 21, and 23 on some dependent
variable, and a treatment condition with scores of 22, 25, 25, and 26. *If the
treatment had no effect on scores*, the first number that we sampled
(17) could just as easily have been found in the second group as in the
first. With 3 observations in the Control condition, and 4
observations in the treatment condition, *and if the null hypothesis is true," *any
3 of those 7 observations could equally well have landed in Control condition, with
the remainder landing in the Treatment condition. The data are
"exchangeable" between conditions. After calculating all of the possible
arrangements of the 7 observations into one group of 3 and another group of 4,
we calculate the relevant test statistic for each arrangement, and compare our
obtained statistic to that sampling distribution. We then reject or retain the
null. There are those who argue that randomization tests are not even concerned
with populations, because it doesn't make sense to talk about a population if
you don't have random sampling. On the other hand, what I have said elsewhere
about Efron and Tibshirani's statement of the null (in their section on
permutation tests) is definitely in conflict
with that. (See link to null hypothesis in the next section.) There is an
excellent discussion of randomization tests by Lunneborg
that is available over the web at http://faculty.washington.edu/lunnebor/Australia/randomiz.pdf.
I consider this required reading to understand the underlying issues behind
randomization tests. Lunneborg writes extremely well, but (and?) he chooses his
words very carefully. Don't read this when you are too tired to do anything
else--you have to be alert.

At this point it might be useful to skim the first part of the page on a randomization test between the means of two samples. If you took my advice and looked at the randomization test for the difference between two means, you saw that we had data from two conditions in which we record the amount of time that it took someone to leave their parking place after that had arrived at their car. In one condition another driver was waiting for the space, and in the other condition there was no one waiting. The question concerns whether the presence of someone who wants the parking space affects the time that it takes the parked driver to leave. Although what follows refers to randomization tests in general, I need to focus on an example, and the example I'm focusing on involves differences between two independent groups.

As I said when talking about bootstrapping, the issue of the null hypothesis
is a bit murkier with nonparametric tests in general than it is with parametric
tests. At the very least we swap loose generic terms like "location"
for more specific terms like "mean." And we generally substitute some
vague statement, such as "having someone waiting will affect the time it
takes to back out," for precise statements like "m_{1}
= m_{2}." This vagueness gains us some
flexibility, but it also makes the interpretation of the test more difficult.

Elsewhere I have spoken about the null hypothesis in conjunction with the bootstrapping approach to a one-way analysis of variance. I put it there because that is where I was when I thought of it, and I can't quite think of a better place to put it. It is an important concept, and I suggest that you look at it.

The basic approach to randomization tests is straightforward. I'll use the two independent group example, but any other example would do about as well.

- Decide on a metric to measure the effect in question.
- For this example I will use the
*t*statistic, though several others are possible.

- For this example I will use the
- Calculate that test statistic on the data (here denoted
*t*_{obt}). - Repeat the following
*N*times, where*N*is a number greater than 1000- Shuffle the data
- Assign the first
*n*_{1}observations to the first condition, and the remaining*n*_{2}observations to the second condition. - Calculate the test statistic (here denoted
*t*) for the reshuffled data._{1}* - If
*t*is greater than_{1}**t*_{obt}increment a counter by 1.- I would normally use absolute values, because I want a two-tailed test.

- Divide the value in the counter by
*N*, to get the proportion of times the*t*on the randomized data exceeded the*t*on the data we actually obtained. - This is the probability of such an extreme result under the null.
- Reject or retain the null on the basis of this probability.

This approach can be taken with any randomization test. We simply need to
modify it to shuffle the appropriate values and calculate the appropriate test
statistic. For example, with multiple conditions we will shuffle the data,
assign them to the appropriate number of conditions, calculate an *F*
statistic on the data, consider whether or not to increment the counter,
reshuffle the data, calculate *F*, and so on. In some cases, the hardest
question to answer is "What should be shuffled?"

The tests that I am discussing seem to go under a variety of names, which doesn't make things any easier. The general phrase "resampling tests" applies to any situation in which the test is based on resampling scores from some pool of data. Bootstrapping and randomization tests are both examples of resampling tests.

A name that has been around for some time is "permutation tests."
It refers to the fact that with randomization tests we permute the data into all
sorts of different orders, and then calculate our test statistic on each
permutation. The only problem with this name, as I see it, is that we aren't
really taking permutations--we are taking different *combinations*. Take an
example of two groups with scores 3, 6, 7 and 5, 8, 9. We want to examine all
possible ways of assigning 3 of those six values of group one, and the rest to
group two. But we don't distinguish the case where group one had 3, 8, 9 from
the case where it had 8, 9, 3. These are the same *combination*, and will
give the same mean, median, variance, etc. So it is the different combinations,
not permutations, that we care about. I could call them "combination
tests," but I would be the only one who did, and I'd look pretty funny
hanging out there all by myself.

"Permutation (or combination) tests" has another implication that we need to worry about. It implies, without stating it, that we take all possible permutations. That is often practically impossible, as we will see in a minute.

The phrase "randomization test" is a nice compromise, because it avoids the awkwardness of "permutation," and doesn't suggest anything about the number of samples. It is also very descriptive. We randomize (i.e., randomly order) our data, and then calculate our statistic on those randomized data. I will try to restrict myself to that label.

In the previous section I suggested that the phrase "permutation
test" implies that we take all possible permutations of the data (or at
least all possible combinations). That is often quite impossible. Suppose that
we have three groups with 20 observations per group. There are 60!/(20!*20!*20!)
possible different combinations of those observations into three groups, and
that means 5.7783*10^{26} combinations, and even the fastest
supercomputer is not up to drawing all of those samples. I suppose that it could
be done if we really had to do it, but it certainly wouldn't be worth waiting
around all that time for the answer.

The solution is that we take a random sample of all possible combinations.
That random sample won't produce an *exact* answer, but it will be so close
that it won't make any difference. The results of 5000 samples will certainly be
close enough to the exact answer to satisfy any reasonable person. (The
difference will come in the 3rd decimal place or beyond). (There is even a sense
in which that approach can be claimed to be exact, but we won't go there.)

When we draw random samples to estimate the result of drawing all possible samples, we often refer to that process as Monte Carlo sampling. I could use that term to distinguish between cases where the number of combinations is small enough to draw all of them, and cases where that is not practical. However there is little to be gained by adding another term, and I will simply use the phrase "randomization tests" for both approaches.

Efron, B. & Tibshirani, R. J. (1993) *An introduction
to the bootstrap*. New York: Chapman and Hall.

Lunneborg, C. E. (2000) Random assignment of available cases: Let the inference fit the design. http://faculty.washington.edu/lunnebor/Australia/randomiz.pdf

Last revised: 07/09/2003