Resampling Statistics:

Randomization and the Bootstrap

This is the second set of web pages that I have built on resampling statistics. The first (Version I) was based on a Visual Basic program that I wrote quite a few years ago. This set (Version II) is based on the R programming environment, which is playing a more and more important role in statistical analysis. I use a good bit of the material from the first set of pages, but will adapt that as needed.

When these pages are finished, all of the example programs can be available for download, along with sample data. The file that you will download is a zipped file, but can be decompressed with WinZip or any other file compression package. The necessary structure of the data files is described in the source code.

I should begin by addressing the different names that are given to these tests. My title calls them "resampling statistics." This comes from the underlying idea that we draw repeated samples from some set of scores and base our conclusions on the results of that resampling. Randomization tests , which we will discuss first, take the set of scores, randomize their ordering, and compute statistics from the results. Permutation tests do the same thing, but I reserve that label for tests in which we take all possible permutations of the data, rather than a subset or rearrangements. Bootstrapping resamples with replacement from a set of data and computes a statistic (such as the mean or median) on each resampled set. Bootstrapping is used primarily for parameter estimation, as we will see.

Theoretical Underpinings of Resampling Tests

The theory of resampling tests is actually quite simple. It boils down to a statement such as "If there is no difference between two treatments, a particular score is just as likely to end up in one group as in the other." From that statement we can draw a large number of random samples from the combined data, randomly assign half of the scores to group 1 and the other half to group 2, calculate some test statistic, such as a mean difference, and ask if the distribution of mean differences we obtain in a situation when there is no difference between the groups look like the mean difference we actually obtained. If, for example, 95% of the mean differences when the groups are treated as equal fall between 115 and 125, and we have a mean difference in our experiment of 98, that certainly doesn't look like something we would find if the groups are comparable. We will conclude that the idea of no-group-differences is wrong, and out treatments did make a difference.

That last paragraph would be written somewhat differently if we were comparing multiple groups, comparing two repeated measures on the same people, computing a correlation, or a number of other tests that we could run, but the underlying logic is just as I have put it above. We merely have to adjust for the kinds of experimental data we have.

The idea behind randomization tests is actually quite an old one in statistics. As Lunneborg (2005) pointed out, Fisher discussed the one-sample test in his very influential text, Design of experiments in 1935. Indeed, Fisher wrote that although "the statistician does not carry out this very tedious process, his conclusions have no justification beyond the fact they could have been arrived at by this very elementary process." Pitman (1937) expanded the idea to cover two-sample, multisample, and correlational tests. They both recognized the practical limits at that time of generating all possible samples from the combined data, and viewed Student's t distribution as giving an approximate answer to the p value. (As Edgington and Onghena (2007) pointed out, Fisher still thought of the tests as being applied in the situation where there was random sampling from some population(s) and that the test was testing the identity of population parameters. Pitman was the one who first showed that randomization tests did not require random sampling from populations, and did not test a null hypothesis about population parameters.

In other words, Fisher thought of the t distribution as an approximation to the true value of p given by the randomization test. Pitman thought of the test as referring to the data at hand and judging whether such data was likely to arise if treatments had no effects (or similar conclusions for other forms of testing). Had Fisher and Pitman had the modern computing services that we now have, tests of hypotheses might well have gone in an entirely different direction.

Resampling procedures fall into a number of different categories, but the discussion here will be limited to Randomization and Bootstrap procedures. Bootstrap procedures take the combined samples as a representation of the population from which the data came, and create 1000 or more bootstrapped samples by drawing, with replacement, from that pseudo-population. Randomization procedures also start with the original data, but, instead of drawing samples with replacement, these procedures systematically or randomly reorder (shuffle) the data 1000 or more times, and calculate the appropriate test statistic on each reordering. Since shuffling data amounts to sampling without replacement, the issue of replacement is one distinction between the two approaches. There are other distinctions, including the fundamental purpose.

Aside from the issue of sampling with replacement, the two approaches differ in a very fundamental way. Bootstrapping is primarily focused on estimating population parameters, and it attempts to draw inferences about the population(s) from which the data came. Randomization approaches, on the other hand, are not particularly concerned about populations and/or their parameters. Instead, randomization procedures focus on the underlying mechanism that led to the data being distributed between groups in the way that they are. Consider, for example, two groups of participants who have been randomly assigned to viewing a stimulus either monocularly or binocularly, and estimating its distance. The bootstrap approach would focus primarily on estimating population differences in distance perception between the two conditions, and its standard error, and would probably result in a confidence interval on the mean or median difference in estimated distance. A randomization test, on the other hand, would ask if it is likely that we would obtain a difference as large as the one we obtained if the monocular/binocular condition had no effect on the apparent distance. Notice that the resampling approach is not concerned with what the estimated distances (or differences in mean distance) were, nor is it even particularly concerned about population parameters. The bootstrap approach, on the other hand, is primarily concerned with parameter estimation. It turns out that these differences have very important implications.

I will begin with several pages on randomization tests, discussing the underlying logic and the various tests at our disposal. This discussion begins with Randomization procedures. After discussing those tests at some length, I will move to bootstrapping with Bootstrapping procedures . For most of these tests there are more efficient functions that others have written for R than the ones that I give. That is deliberate. The code that I give is intended to illustrate the underlying logic and procedures, whereas to use someone else's function would give you the answer without showing you where that answer came from. Once you understand the underlying logic, by all means search out more efficient ways to do things.

Available Software

In the past a major difficulty with both bootstrapping and randomization procedures concerned the availability of computer software. Although the techniques are conceptually very simple, you must have computer software to do the resampling. And that software is not routinely available in programs like SPSS, SAS, and Minitab. One of the first people to write about such tests and to supply the software was Edgington (1980). There have been a number of other programs written, and one of my favorites was Reampling Stats by Simon and Bruce. I still like it, but it is no longer free, and I am cheap. (In addition, current versions operate through Excel, and I have a strange aversion to that approach, although it is a perfectly legitimate one. Of course my favorites are the Visual Basic program that I wrote (It draws really nice graphics) and the R programs associated with these web pages. Other procedures are available within R, but I like my own.

Bryan Manly, author of Manly, B.(1997) Randomization, Bootstrap, and Monte Carlo Methods in Biology (2nd edition).London: Chapman & Hall, has written a program called RT (for randomization testing). An examination copy can be downloaded from http://www.west-, along with the Fortran code that lies behind it. Lunneborg(2000) has an excellent text that illustrates many of the computations using Resampling Stats, S-Plus, or SC. S-Plus is very similar to R, and R is free. If these are not enough choices, just go to Google and you will find more.


I am running out of time to incorporate all of the necessary references to good sources for resampling statistics. A surprisingly good list can be found at the wikipedia site under resampling. This is a good cite, though it is rather brief, and the list of references is excellent. The books by Lunneborg, Efron and Tibshirani, Manly, Edgington, and Sprent are all very good. Lunneborg is more concerned with bootstrapping than randomization, but you will get a lot from it if you read carefully, with a pencil in hand. Edgington has been writing on randomization tests for many years, and is a very readable source. He provides some Fortran programs that you can translate into other languages, but they are so tightly written to improve efficiency that they can drive you crazy when you try to figure out what they are doing. (That's not his fault-- that's what he needed to do in the days before gigahertz computers.) Good's two books on this topic are good, though he occasionally skips over the details that you were looking for.

For bootstrapping, a nice introduction is a Sage Publications 1993 monograph by Mooney and Duval. It is a good place to start. Another good place to start is to go to the Resampling Stats website and look at some of their references.

Next Steps

David C. Howell
University of Vermont