Alternative Tests for Contingency
Tables
Using Randomization Tests
David C. Howell

This page was designed to go with an entry that I wrote for
the International Encyclopedia of Statistical Sciences, Lovric
(2010). In that paper I discussed four different designs
under which one could derive a contingency table, and pointed out
that there were randomization tests that could replace
Pearson's chi-square for at least three of them. (I
haven't figured out a sampling scheme for the fourth.) The
main purpose of this page is to present those randomization tests
as functions (or programs) in R, which is a freely available and very popular software package.
R can be
download from the R Project at http://www.r-project.org/.
Pearson's chi-square was originally developed by Pearson
in papers published in 1900 and 1904. It has had somewhat of a
controversial past, beginning with the fact that Pearson got the
degrees of freedom wrong and was most unhappy when Fisher pointed
that out. (He must have been especially annoyed because Fisher used data from Pearson's own son (Egon)). Essentially, Pearson, with Fisher's correction, showed that the statistic
was distributed approximately as the chi-square
distribution. For an r × c table it will be distributed on
(r-1)(c-1) df. And when the sample sizes are reasonably
large with at an expected value of 5 in each cell, the
approximation is quite good. Certainly good enough for our
purposes.
The problem arises when the expected frequency in one or more
cells is too small. When that happens, the chi-square statistic
can only take on a limited number of different values, and hence
its distribution cannot be well approximated by the chi-square
test. For example, when we have a 2 × 2 table with marginal
frequencies of 4 in each row and column, there are only 3
possible values of chi-square (0, 2, and 8). Certainly we cannot
expect a continuous distribution to be a good fit under such
extreme conditions.
In an attempt to adjust for this problem, Yates(1934) proposed a correction
that simply involved reducing the absolute value of each
numerator by 0.5 before squaring. This correction actually worked
reasonably well, but it was still an approximation. Yates'
correction has been falling out of favor for many years, but most
software still produces it, sometimes without notice. In its
place Fisher's Exact Test Fisher
(1935), sometimes known as the Fisher-Irwin
Test, has taken its place in many situations. This test is not
dependent on the chi-square distribution and is, under certain
conditions, exact.
Alternative Data Collection Designs
All Marginals Fixed
Fisher's tea-tasting experiment is a
classic example of a data collection procedure in which the
marginal totals are fixed. Muriel Bristol, a scientist the the
Rothamstead Experiment Station, claimed that she could tell the
difference between a cup of tea in which the milk had been poured
before the tea and one in which the tea had been poured before
the milk. Fisher devised an experiment in which he made 8 cups of
tea. Four of them had the milk poured first and 4 had the tea
poured first. This fixes the row marginals at 4 because every
replication of this experiment would have row totals of 4. Ms
Bristol was asked to taste each cup and assign 4 cups to each
order of pouring. This also fixes the column totals because they
will always be 4 and 4.
As an aside, let me say something about Muriel Bristol, because
others have suggested that this story was just made up. According to
Fisher’ s daughter, Joan Box, it is a true story. By the way, Muriel was no slouch.
She was a Ph.D. scientist, back in the days when women were not Ph.D. scientists and
she established the Rothamstead Experiment Station in 1919. This was the place that
Fisher was later to make famous. I think you should know what Muriel looked like, so
here she is - taken without permission from the Web.
If you apply the standard chi-square test (one -sided) to these data you will have:
### Traditional chi-square test
Bristol <- matrix(c(3,1,1,3), nrow = 2,
dimnames = list(Row = c("1","2"), Col = c("1","2")))
cat("Pearson's (two-sided) chi-square test without Yates' correction","\n")
chisq <- chisq.test(Bristol, correct = FALSE)
print(chisq)
#_________________________________________________________________________
Pearson's Chi-squared test
data: Bristol
X-squared = 2, df = 1, p-value = 0.1573
# =============================================================
Instead of applying Pearson's chi-square
test to these data, Fisher employed the hypergeometric
distribution. Essentially the test computes the probability of 0
to 4 cups in cell11 when all of the marginals are 4.
This is an exact probability given fixed marginals, and Fisher
proposed it as the optimal solution to the problem. The
probabilities of each possible outcome are shown in the table
below
Observed
# Cell 11 |
Probability |
0 |
.014 |
1 |
.229 |
2 |
.514 |
3 |
.229 |
4 |
.014 |
Ideally, Muriel should have placed 4 cups in cell11
and 4 cups in cell22. The probability that she would
do that well by chance in only .014, and we would likely conclude
that she had a strange ability with respect to tea tasting. But, unfortunately, she had cell
totals of 3/1/1/3, which would have a cumulative probability of
.013 + .229 = .243, which would not lead to rejection of the null
hypothesis. (Others have claimed that she guessed correctly on all trials.)
In this particular example, and in all of the other
examples that my limited imagination can think of, we will only
get excited if the judge does very well. We won't
particularly care if she gets it all wrong. So what we really
have here is a one-tailed test. We sum the probabilities of 4 and
3 correct responses, but not 1 and 2 correct. A two-tailed test does not make sense here.
I should point out that although this is a real example, which is partly why I used it, it is not a very good example from the point of understanding tea. With only 4 trials, only one outcome could lead to a statistically significant result. Fisher should have made up mamy more cups of tea to give Muriel a fair shot.
A function already exists for R that will calculate
Fisher's exact probability. It is named
"fisher.test" and is in the "stats" package,
which is part of the base package of R and does not need to be
installed separately. The code to compute Fisher's Exact Test
follows.
Bristol <- matrix(c(3,1,1,3), nrow = 2,
dimnames = list(Row = c("1","2"), Col = c("1","2")))
cat("Fisher's exact test--one-sided","\n")
ft <- fisher.test(Bristol, alternative = "greater", conf.int = TRUE)
#For a two-tailed test change to alternative = "two.sided"
#two.sided and conf.int apply only to 2x2
print(ft)
# ________________________________________________________________
Fisher's Exact Test for Count Data
data: Bristol
p-value = 0.2429
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.3135693 Inf
sample estimates:
odds ratio
6.408309
# =================================================================
In this case we
are only concerned whether Ms Bristol scored better than chance,
so I have specified a one-tailed by adding 'alternative =
"greater" .' The results follow the program,and, again, they are not statisticlly significant.
One set of marginals fixed
Fisher's Exact Test is an exact test only if both sets of
marginals are fixed. In that case the reference distribution
consists of probability values associated with all possible
arrangements of data preserving those marginal totals. But what
if only the row (or column) marginals are fixed. An example of
such a situation is taken from my entry referred to above.
In 2000 the Vermont legislature approved a bill authorizing
civil unions. The results of that vote, broken down by the gender
of the legislator, is shown below.
Vote
|
|
For |
Against |
Total |
Women |
35
|
9
|
44
|
Men
|
60
|
41
|
101
|
Total
|
95
|
50
|
145
|
This was an important vote, and all legislators were there. So
if the vote were to be repeated over and over again, there would
always be 44 women and 101 men. In other words the row totals are
fixed. But people can change their mind, so the column totals would not be known in advance, so they
are random. If we apply the standard Pearson chi-square test to
these data to test the hypothesis that there is a gender difference
in the voting pattern, we have χ2 = 5.50 on 1 df, with an
associated probability of .019.
From a slightly different perspective, this test reduces to
the binomial distribution. In this case the test is simply the
test that the proportion of cases in each column is the same. You can solve this
problem in R by using the prop.test() function.You will obtain the same result from prop.test() and chisq.test() so long as you do, or do not, call for Yates' correction with each test.
If we wish to create a randomization test for this design, the
appropriate reference distribution would be the probabilities
associated with all possible outcomes having those specific row
totals. This is no longer the hypergeometric distribution because
the column totals are not longer fixed. The R code for such a
sampling design is shown below. The first thing
that we do is to calculate the obtained chi-square for the data
(here, 5.50) To create the random samples, we first calculate the
marginal column probabilities. For this example they are .655 and
.345. We then draw 44 uniform random numbers for the first row and assign cases to
cell11 with probability = .655. To do this we draw 44
uniformly distributed random numbers between 0 and 1. If a number
is less than 0.655, that case is assigned to the "For"
category. Otherwise it is assigned to the "Against"
category. This process is repeated for 101 cases in row 2. We
then calculate a chi-square statistic, though several other
measures are possible, such as the frequency of
cell11, but chi-square is a good choice because it
does not depend on the dimensionality of the table. (It is important
to point out that although we are calculating a chi-square statistic, we are calculating it as a measure of deviation from expected, NOT as a standard chi-square test that would be compared to the chi-square distribution.) This process
is repeated a large number of times (here 10,000) and the
chi-square values for each random table are recorded. These are the chi-square values expected under a true null hypothesis. Finally,
the "exact" probability is computed as the proportion
of the 10,000 tables whose χ2 values exceed the obtained χ2 for the original data table.
The probability in this case, and in the next, is not truly
"exact" because we generate a random set of tables
rather than the full set of all possible tables with those row
totals. However, with 10,000 samples the probability will be very
close to exact. And because we have large cell frequencies in our
example, it should be close to the probability given by
Pearson's chi-square. The code also produces a likelihood
ratio chi-square and its probability.
The R-code for this analysis is shown below. Notice that I have several different examples from which you can draw, most of which can be found in my book.
# Chi-square with only one set of marginal totals fixed.
# Alternative data sets follow--You must set the default directory and specify which file equals "data."
# These data sets vary in terms of which variable you can assume to be fixed. Sometimes none seem reasonable.
### You must remove "###" before two lines for one set of data.
### Visintainer <- read.table("Visintainer.dat", nrows = 2, header = TRUE) #2 x 3 example
### data <- Visintainer
### Jankowski <- matrix(c(512, 54, 227, 37, 59, 15, 18, 12), nrow = 4, byrow = TRUE)
#### data <- Jankowski
### UnahBoger <- matrix(c(33, 251, 33, 508), nrow = 2, byrow = TRUE)
### data <- UnahBoger
### Hout <- matrix(c(7, 7, 2, 3, 2, 8, 3, 7, 1, 5, 4, 9, 2, 8, 9, 14), byrow = TRUE, nrow = 4)
### data <-Hout
civilUnion <- matrix(c(35,9,60,41), nrow = 2, byrow = TRUE) #Civil Union example from Ex. 6.11
data <- civilUnion
result <- chisq.test(data, correct = F)
obtchisq <- result$statistic
dfs <- result$parameter
p <- result$p.value
N <- sum(data)
cat("The original data","\n")
print(data)
cat("\n")
print(result)
rowmarg <- rowSums(data)
nrow <- length(rowmarg)
colmarg <- colSums(data)
ncol <- length(colmarg)
colprob <- colmarg/N
chisquare <- numeric(nreps)
results <- numeric(nreps)
extreme <- 0
# Get cumulative proportions over columns
# We might find that 60% of the obs are in col1, 25% in col2, and 15% in col3
# Then draw as many numbers as the row marginal, and assign to col based on
# those probabilities. That does not mean that we will always have 60, 25, and
# 15%, because assignment depends on whether the random number exceeds some value.
cum <- numeric()
cum[1] <- colprob[1]
for (j in 2:ncol) {cum[j] <- cum[j-1] + colprob[j]}
# Now resample nreps times
# nreps <- 10000
count <- c()
countx <- c()
for (i in 1:nreps) {
randomtable <- c()
for (j in 1:nrow) {
randnum <- runif(rowmarg[j],0,1)
for (j in 1:ncol) {count[j] <- length(randnum[randnum <= cum[j]])}
countx[1] <- count[1]
for (k in 2:ncol) {countx[k] <- count[k] - count[k-1]}
randomtable <- rbind(randomtable, countx)
}
chisquare[i] <- chisq.test(randomtable, correct = FALSE)$statistic
}
print( "Ignore warnings if present!")
signif <- length(chisquare[chisquare >= obtchisq]) /nreps
cat("The corresponding p value is ", signif)
### Highlight the above code, run it repeatedly, and compare to standard chi-square
###
par(mfrow=c(2,2))
hist(chisquare, breaks = 50, main = "Distribution of chi-square under null")
library(car)
qqPlot(chisquare, dist = "chisq", df = dfs, id = FALSE, grid = TRUE)
cat("Pearson's chi-square statistic is ",obtchisq,"\n")
cat("That chi-square has a p value of ",p, "\n")
cat("\n")
cat("The proportion of chi-square values greater than the obtained chi-square is ", signif, "\n")
### This is the desired probability.
No Marginals Fixed
Suppose that instead of asking the Vermont Legislators to take
a vote, we went out and drew a random sample of 145 Vermont
residents and asked their opinion. In this case neither the row
nor the column marginals would be fixed because we do not know in
advance how many men and women will be in our sample, nor how
many will vote "for and "against" civil unions. This
type of design is quite common.
I have struggled with the solution to this example for a very long time, but I am not at all satisfied with my answer. Perhaps it makes sense, but when I come back tomorrow I will probably think about what an idiot I am. I think that I have just replaced "fixed marginal totals" with "fixed marginal proportions." That leaves some slack in the resulting marginal totals after resampling, but not a lot.) But let me continue with my original solution, in hopes that someone can write to me and prove that I am correct--or suggest an alternative approach.
The appropriate reference distribution of random tables is
different in this case. What I did was to compute the proportions
in each row and the proportions in each column. If Vote is
independent of Gender, the probability of an observation falling
in each cell is the product of its row and column proportions. I
then sampled N uniformly distributed random numbers between
0 and 1, and made cell assignments based on cell probabilities.
For example, if in a 2 × 3 table 40% of observations fell in
row 1 and 30% of observations fell in column 3, then I would
expect 40% × 30% = 12% to fall in cell13 if rows
and columns are independent. Assume that the corresponding percentages in row 1
were 20% and 5% for cell11 and cell12 Then
a random number between 0 and .20 would be assigned to
cell11, a number between .20001 and .25 would be
assigned to cell12, a number between .2500001 and .37
would be assigned to cell13, and so on. The cell
frequencies that result would be a random sample of cell
frequencies having a total sample size of N. The rest of
the process is the same as above. Out of 10,000 tables, and their
associated chi-squares, the probability under the null would be
the proportion of them that equaled or exceeded the chi-square
for the obtained data. The R code to carry out this analysis is
shown below.
Click Here for Code
####################### (No Fixed Marginals) ######## ########################################################
# Analysis of contingency tables conditional only on the total sample size
# I have changed this to make the cell freq under H0 equal to (on average) the
# expected frequencies with those marginal totals.
# It is the only way that I can think of to set up a sampling scheme. BUT I am
# not restricting the sampling to those marginal totals.
# The qq plot fails if you change nreps away from 10000 because of the
# way I create length of variables.I should fix that some day.
#read in the data
civilUnion <- matrix(c(35, 9, 60, 41), nrow = 2, byrow = TRUE)
#data <- matrix(c(3,1,1,3), nrow = 2 )
#data <- read.table("DeathModified.dat", header = TRUE, nrows = 2)
#data <- read.table("Visintainer.dat", header = TRUE)
#data <- read.table("Darley.dat", header = TRUE)
#data <- matrix(c(12,2,1,3,2,4,3,4,187,17,11,9,16,7,8,6), nrow = 2, byrow = TRUE)
#data <- matrix(c(13,36,14,30), nrow = 2, byrow = TRUE)
#rownames(oata) <- c("Drug","Placebo")
#colnames(data) <- c("Relapse","Success")
#data <- matrix(c(12,2,1,3,2,4,3,4,187,17,11,9,16,7,8,6), nrow = 2, byrow = TRUE)
# Establish sample size and number of rows and columns (rs and cs)
data <- civilUnion
N <- sum(data)
nr <- nrow(data)
nc <- ncol(data)
ncells <- nr*nc
# Establish row and column marginal totals
rowtots <- rowSums(data)
coltots <- colSums(data)
# Calculate expected cell probabilities based on original marginal totals
# and convert to vector
expected <- matrix(data = NA, nrow = nr, ncol = nc, byrow = TRUE)
for (i in 1:nr) {
for (j in 1:nc) {
expected[i,j] <- rowtots[i] * coltots[j] /N^2
}
}
# Convert to vector
expectedvector <- as.vector(t(expected))
# This can now be sampled
# Calculate chi-square on original data
chisq.test(data, correct = F)
obtchisq <- chisq.test(data, correct = F)$statistic
nreps <- 10000
xx <- numeric(N)
results <- numeric(nreps)
greater <- 0
cells <- numeric(ncells)
cells <- c(1:ncells)
# Now resample nreps times
for (i in 1:nreps) {
xx <- sample(cells,N,replace = TRUE, prob = expectedvector)
vv <- matrix(0,1,ncells)
for (j in 1:N) {
b <- xx[j]
vv[b] = vv[b] + 1
}
x <- matrix(vv,nrow = nr, byrow = TRUE)
# x is now the contingency table for ith random sample
results[i]<- chisq.test(x, correct = FALSE)$statistic
greater <- ifelse(results[i] >= obtchisq, greater + 1, greater + 0)
}
probGreater <- greater/nreps
cat("With no fixed marginals p(chi-square) = ",probGreater)
dfs <- (nr - 1)*(nc - 1)
cat("For a chi-square distribution the mean should be = ", dfs, ". It is actually ",mean(results, na.rm = TRUE),"\n")
cat("\n")
cat("For a chi-square distribution the var. should be = ", 2*dfs,". It is actually ",var(results, na.rm = "TRUE"), "\n")
cat("\n")
cat(" Look at the graphic for the distribution and a qq plot", "\n")
cat("\n")
par(mfrow=c(2,2))
hist(results, xlab = "Chi-Square", ylab = "Frequency", breaks = 50, main = "Chi-square distribution")
#
#
#
qqPlot(results, dist = "chisq", df = dfs)
#Using Monte-Carlo simulation
cat("The following is the result of letting R do a simulation for chisq.test","\n")
cat("\n")
print(chisq.test(data, simulate.p.value = TRUE, B = 10000))
#
cat("The following is the result of Fisher's exact test ","\n")
cat("\n")
print(fisher.test(data, conf.level = 0.95))
cat("\n")
cat("Do not be alarmed by warnings if you have some small marginals. ", "\n")
cat("They appear because I used the chi-square statistic as my indicator of disparity.", "\n")
Nothing Fixed
The logical next step would be to consider a study where we
went into classrooms and asked students to vote. In this case we
not only do not know how many men and women there will be, nor
the number of For and Against votes, but we don't even know
the total sample size.
It may be possible to generate random samples with this
design, but I have not figured out how to do it. I would first
have to draw a random sample size, then repeated the calculations
in the previous section, then draw another random sample size,
and so on. I seriously doubt that it is worth the computational
time needed to carry out this process. I suppose that it could be
done, but I'm not going to do it.
Conclusion
In many situations the above procedures might be considered
overkill. When we have large expected frequencies in all of the
cells, the Pearson chi-square test is quite appropriate. The
difference in probability values between that statistic and the
ones given here will be small. However when we have one or more
small expected frequencies things start to fall apart. Campbell
(2007) in an extensive study of 2 × 2 tables, concluded that
Fisher's Exact Test works well enough when we have small
expected frequencies. He also concluded that a modified
Pearson's chi-square [χ2(N/(N-1))]
is satisfactory when the smallest expected frequency is greater
than 1 for 2 × 2 tables. (I believe that this adjustment can be attributed to Karl Pearson's son Egon, who was also a noted statistician.) However there is no reason why we
need to be satisfied with "well enough" when we have a
more appropriate solution.
×
Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistical in Medicine, 26, 3661-3675.
Fisher, R. A. (1935). The Design of
Experiments. Edinburgh, Oliver and Boyd.
Howell, D. C. ,(2010). Chi-square test: Analysis of contingency tables. In Lovric, M. (2011). International Encyclopedia of Statistical Science, Springer-Verlag, Berlin.
Yates, F. (1934). Contingency tables
involving small numbers and the χ2 test.
Journal of the Royal Statistical Society Supplement
1:217-235.
Last revised 6/25/2015