Alternative Tests for Contingency Tables

Using R with Randomization Tests

David C. Howell

This page was designed to go with an entry that I wrote entitled chi-square-alternatives.html for International Encyclopedia of Statistical Sciences, Lovric (2010). In that paper I discussed four different designs under which one could derive a contingency table, and pointed out that there were randomization tests that could replace Pearson's chi-square for at least three of them. (I haven't figured out a sampling scheme for the fourth.) The main purpose of this page is to present those randomization tests as functions (or programs) in R, which is freely available software that is becoming more and more important in most fields of statistics. R can be download from the R Project at http://www.r-project.org/.

Pearson's chi-square was originally developed by Pearson in papers published in 1900 and 1904. It has had somewhat of a controversial past, beginning with the fact that Pearson got the degrees of freedom wrong and was most unhappy when Fisher pointed that out. (He must have been especially annoyed because Fisher used data from Pearson's own son (Egon) to make his point). Essentially, Pearson, with Fisher's correction, showed that the statistic
$χ^{2} = Σ \frac{{(O - E)}^{2}}{E}$
was distributed approximately as the chi-square distribution. In this formula "O" represents the individual obtained cell frequencies, and "E" represents the individual cell frequencies expected under the null hypothesis. For an r × c table it will be distributed as chi-square on (r-1)(c-1) df. And when the sample sizes are reasonably large, with at an expected value of at least 5 in each cell, the approximation is quite good. Certainly good enough for our purposes. (There has been a long debate about what the minimum sample size should be, but I'm going to skip over that debate. It has never come to any resolution.)

The problem arises when the expected frequency in one or more cells is too small. When that happens, the chi-square statistic can only take on a limited number of different values, and hence its distribution cannot be well approximated by the chi-square distribution. For example, when we have a 2 × 2 table with marginal frequencies of 4 in each row and column, there are only 3 possible values of chi-square (0, 2, and 8). Certainly we cannot expect a continuous distribution to be a good fit under such extreme conditions.

In an attempt to adjust for this problem, Yates(1934) proposed a correction that simply involved reducing the absolute value of each numerator by 0.5 before squaring. This correction actually worked reasonably well, but it was still an approximation. Yates' correction has been falling out of favor for many years, but most software still produces it, sometimes without notice. In its place Fisher's Exact Test Fisher (1935), sometimes known as the Fisher-Irwin Test, has taken its place in many situations. This test is not dependent on the chi-square distribution and is, under certain conditions, exact.

Alternative Data Collection Designs

All Marginals Fixed

Fisher's famous tea-tasting experiment is a classic example of a data collection procedure in which the marginal totals are fixed. Muriel Bristol, a scientist the the Rothamstead Experiment Station, claimed that she could tell the difference between a cup of tea in which the milk had been poured before the tea and one in which the tea had been poured before the milk. Fisher devised an experiment in which he made 8 cups of tea. Four of them had the milk poured first and 4 had the tea poured first. This fixes the row marginals at 4 because every replication of this experiment would have row totals of 4. Ms Bristol was asked to taste each cup and assign 4 cups to each order of pouring. This also fixes the column totals because they will always be 4 and 5.

As an aside, let me say something about Muriel Bristol, because others have suggested that this story was just made up. According to Fisher’ s daughter, Joan, it is a true story. By the way, Muriel was no slouch. She was a Ph.D. scientist, back in the days when women were not Ph.D. scientists, and she established the Rothamstead Experiment Station in 1919. This was the place that Fisher was later to make famous. I think you should know what Muriel looked like, so here she is - stolen without permission from the Web.

Instead of applying Pearson's chi-square test to these data, Fisher employed the hypergeometric distribution. Essentially the test computes the probability of 0 to 4 cups in cell₁₁ when all of the marginals are 4. This is an exact probability given fixed marginals, and Fisher proposed it as the optimal solution to the problem. The probabilities of each possible outcome are shown in the table below. (Fisher had to worry only about cell₁₁ of a 2 × 2 table because the number in that cell determined the values in the other three cells.)

Observed # Cell 11	Probability
0	.014
1	.229
2	.514
3	.229
4	.014

Ideally, Muriel should have placed 4 cups in cell₁₁ and 4 cups in cell₂₂. The probability that she would do that well by chance in only .014, and we would likely conclude that she had a strange ability with respect to tea tasting. But, unfortunately, she had cell totals of 3/1/1/3, which would have a cumulative probability of .013 + .229 = .243, which would not lead to rejection of the null hypothesis. (Other sources have claimed that she guessed correctly on all trials. ) In this particular example, and in all of the other examples that my limited imagination can think of, we will only get excited if the judge does very well. We won't particularly care if she gets it all wrong. So what we really have here is a one-tailed test. We sum the probabilities of 4 and 3 correct responses, but not 1 and 2 correct. A two-tailed test does not make sense here.

I should point out that although this is a real example, which is partly why I used it, it is not a very good example from the point of understanding the preparation of tea. With only 4 trials, only one outcome could lead to a statistically significant result. Fisher should have made up many more cups of tea to give Muriel fair shot.

We can compute the standard chi-square test in R using the function chisq.test(), specifying whether you want to use Yates' correction or not. (I recommend not using it.) A function already exists in R that will calculate Fisher's exact probability. It is named "fisher.test" and is in the "stats" package, which is part of the base package of R and does not need to be installed separately. The code to compute the standard chisq.test() and Fisher's Exact Test as well as the corresponding chi-square test with, and without, Yates' correction can be found the the file named R-Code.R and is given below in a slightly modified form. In this example we are only concerned whether Ms Bristol scored better than chance, so I have specified a one-tailed test by adding 'alternative = "greater" .' The standard chi-square test is always two-sided, so to draw a comparison you would want to divide those probabilities by 2. The results follow the program. The warnings simply refer to the fact that expected values are less than 5.



Fisher's Exact Test 
# First using the two standard chi-square tests.

Bristol <- matrix(c(3,1,1,3), nrow = 2,
         dimnames = list(Row = c("1","2"), Col = c("1","2")))
# Remember that "matrix" reads down columns 
cat("The data matrix","\n")
print(Bristol)
cat("\n")
cat("Pearson's (two-sided) chi-square test without
 Yates' correction","\n")
pchisq <- chisq.test(Bristol, correct = FALSE)
print(pchisq)
cat("Pearson's (two-sided) chi-square test with Yates'
 correction","\n")
pchisq <- chisq.test(Bristol, correct = TRUE)
print(pchisq)
cat("Notice that these  two tests are two-sided while Fisher's test",
 "\n","is set up as one-sided.","\n")


# Now to Fisher's Exact Test for 2 X 2
# Optimal for designs with fixed marginals
# See "?fisher.test" for more information.


cat("Fisher's exact test--one-sided","\n") 
ft <- fisher.test(Bristol, alternative = "greater", conf.int = TRUE)  
# For a two-tailed test change to alternative = "two.sided"
# Note: two.sided and conf.int apply only to 2x2
print(ft)

 
                         Results

	  Pearson's Chi-squared test

data:  Bristol
X-squared = 2, df = 1, p-value = 0.1573
		 
         Fisher's Exact Test for Count Data -- One-Tailed

data:  Bristol
p-value = 0.2429
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.3135693       Inf
sample estimates:
odds ratio 
  6.408309 

	Fisher's Exact Test for Count Data -- Two-Tailed

data:  Bristol
p-value = 0.4857
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   0.2117329 621.9337505
sample estimates:
odds ratio 
  6.408309

As originally conceived, this test applied only to 2 × 2 tables. However the test is not limited by the size of the table (within reasonable limits). See the paper by Howell and Gordon (1976) for a discussion of this test for larger tables.

(Howell, D. C., & Gordon, L. R. (1976). Computing the exact probability of an R X C contingency table with fixed marginal totals. Behavior Research Methods and Instrumentation, 8, 317.) Also see the help file (?fisher.test) for a fuller explanation.The test is not always exact for larger tables, but it is very close.

One set of marginals fixed

Fisher's Exact Test is an exact test only if both sets of marginals are fixed. In that case the reference distribution consists of probability values associated with all possible arrangements of data preserving those marginal totals. But what if only the row (or column) marginals are fixed.

In 2000 the Vermont legislature approved a bill authorizing civil unions. The results of that vote, broken down by the gender of the legislator, is shown below.

Vote
	For	Against
Women	35	9	44
Men	60	41	101
	95	50	145

This was an important vote, and all legislators were there. So if the vote were to be repeated over and over again, there would always be 44 women and 101 men. In other words the row totals are fixed. The column totals would not be known in advance, (one or more legislators could change their mind) so they are random. If we apply the standard Pearson chi-square test to these data to test the hypothesis that there is a gender difference in the voting pattern, we have χ² = 5.50 on 1 df, with an associated probability of .019. Fisher's exact test does not apply because one set of marginals is not fixed.

For the randomization test for this design (shown below), the appropriate reference distribution is based on the probabilities associated with all possible outcomes having those specific row totals. This is no longer the hypergeometric distribution because the column totals are no longer fixed. An good example was brought to my attention by Steve Byers, who was analyzing skeletal data from four archaeological sites in Louisiana that span the Woodland period (800 B.C. to A.D. 1200). I am going to stay with a 2 x 2 table, but the approach would be the same more than 2 columns. (The code to run his analysis is available at SteveByers.R.) If I were doing this test from scratch, the first thing that I would do is to calculate the obtained chi-square for the data (here, 5.50). I compute this so that I can compare the results of the randomization test against it. I am not interested in using it to calculate a probability against the chi-square distribution. To create the random samples, we next calculate the marginal column probabilities. For this example they are .655 and .345. We then draw 44 cases for the first row and assign cases to cell₁₁ with probability = .655. To do this we draw 44 uniformly distributed random numbers between 0 and 1. If a number is less than 0.655, that case is assigned to the "For" category. Otherwise it is assigned to the "Against" category. This process is repeated for 101 cases in row 2. (Notice, that if the null hypothesis is true, we have just generated data for each row where the probability of "For" is the same for males and females.) We then calculate a chi-square statistic, though several other measures are possible, such as the frequency of cell₁₁, but chi-square is a good choice because it does not depend on the dimensionality of the table. (It is important to point out that although we are calculating a chi-square statistic, we are calculating it as a measure of deviation from expected, NOT as a standard chi-square test that would be compared to the chi-square distribution.) This process is repeated a large number of times (here 10,000) and the chi-square values for each random table are recorded. These are obtained values of chi-square when the null hypothesis is true. Finally, the "exact" probability is computed as the proportion of the 10,000 tables whose χ² values exceed χ² for the original data table.

The probability in this case, and in the next, is not completely "exact" because we generate a random set of tables rather than the full set of all possible tables with those row totals. However, with 10,000 samples the probability will be very close to exact. And because we have large cell frequencies in our example, it should also be close to the probability given by Pearson's chi-square.

The R-code for a standard chi-square test, with and without Yates correction, Fisher's exact test, and a randomization test is given below. The latter is the one that I care most about.


#First using two standard chi-square tests.
cat("Pearson's (two-sided) chi-square test without Yates' correction","\n")
data <- matrix(c(35, 9, 60, 41), ncol = 2, byrow = TRUE, 
dimnames = list(c("For", "Against"),c("Women", "Men")))
chisq.test(data, correct = FALSE)

# Pearson's Chi-squared test without Yates' correction
data:  data
X-squared = 5.5023, df = 1, p-value = 0.01899


# Pearson's Chi-squared test with Yates' continuity correction
data:  data
X-squared = 4.647, df = 1, p-value = 0.03111
# Note that here the correction does affect the conclusion that we would draw.

# Fisher's Exact Test
fisher.test(data)
	Fisher's Exact Test for Count Data

data:  data
p-value = 0.02262
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.095834 6.934692
sample estimates:
odds ratio 
    2.6405 

###########################################################


# Randomization test using chi-sq.test() as our measure
 but requesting random sampling

chisq.test(x = data, correct = FALSE, simulate.p.value = TRUE, B = 10000)
#  Pearson's Chi-squared test with simulated p-value (based on 10,000 replicates)

X-squared = 5.5023, df = NA, p-value = 0.0238

# Now randomization test the normal way.


###   Row Marginals Fixed

origdata <- matrix(c(35, 9, 60, 41), nrow = 2, byrow = TRUE)
         # Another possible set of data -- ignore error messages
         # origdata <- matrix(c(12,2,1,3,2,4,3,4,187,17,11,9,16,7,8,6),
		 nrow = 2, byrow = TRUE)
# Establish sample size and number of rows and columns  (rs and cs)
N <- sum(origdata)
nr <- nrow(origdata)
nc <- ncol(origdata)
ncells <- nr*nc
# Establish row and column marginal totals
rowtots <- rowSums(origdata)
coltots <- colSums(origdata)
# Calculate expected cell probabilities based on original marginal totals
# and convert to vector
expected <- matrix(data = NA, nrow = nr, ncol = nc, byrow = TRUE)
for (i in 1:nr) {
  for (j in 1:nc) {
    expected[i,j] <- rowtots[i] * coltots[j] /N^2
  }
}
# Convert to vector  
expectedvector <- as.vector(t(expected))
# This can now be sampled

# Calculate chi-square on original data
obtchisq <- chisq.test(origdata, correct = F)$statistic    

nreps <- 10000                    
xx <- numeric(N)
results <- numeric(nreps)
greater <- 0                
cells <- numeric(ncells)
cells <- c(1:ncells) 
# Now resample nreps times

for (i in 1:nreps) {
  xx <- sample(cells,N,replace = TRUE, prob = expectedvector)
  vv <- matrix(0,1,ncells)
  for (j in 1:N) {
    b <- xx[j]
    vv[b] = vv[b] + 1
  }
  
  x <- matrix(vv,nrow = nr, byrow = TRUE)
  # x is now the contingency table for ith random sample
  
  results[i]<- chisq.test(x, correct = FALSE)$statistic
  greater <- ifelse(results[i] >= obtchisq, greater + 1, greater + 0)  
}

dfs <- (nr - 1)*(nc - 1)
cat("For a chi-square distribution the mean should be = ", dfs, ". 
#It is actually ",mean(results, na.rm = TRUE),"\n")
cat("\n")
cat("For a chi-square distribution the var. should be = ", 2*dfs,". 
#It is actually ",var(results, na.rm = "TRUE"), "\n")
cat("\n")

cat(" Look at the graphic for the distribution and a qq plot", "\n")
cat("\n")
par(mfrow=c(1,2))
hist(results, xlab = "Chi-Square", ylab = "Frequency", breaks = 50,
# main = "Chi-square distribution")
# Use the car library for the qq plot against a chi-square distrib.
#
library(car)
qqPlot(results, dist = "chisq", df = dfs)

No Marginals Fixed

Suppose that instead of asking the Vermont Legislators to take a vote, we went out and drew a random sample of 145 Vermont residents and asked their opinion. In this case neither the row nor the column marginals would be fixed because we do not know in advance how many men and women will be in our sample, nor how many will vote "for " and "against " civil unions. This type of design is quite common.

I have struggled with the solution to this example for a very long time, but I am not at all satisfied with my answer. Perhaps it makes sense, but when I come back tomorrow I will probably think about what an idiot I am. I think that I have just replaced "fixed marginal totals" with "fixed marginal proportions." That leaves some slack in the resulting marginal totals after resampling, but not a lot.) But let me continue with my original solution, in hopes that someone can write to me and prove that I am correct--or suggest an alternative approach.

The appropriate reference distribution of random tables is different in this case. What I did was to compute the proportions in each row and the proportions in each column. If Vote is independent of Gender, the probability of an observation falling in each cell is the product of its row and column proportions. I then sampled N uniformly distributed random numbers between 0 and 1, and made cell assignments based on cell probabilities. For example, if in a 2 × 3 table 40% of observations fell in row 1 and 30% of observations fell in column 3, then I would expect 40% × 30% = 12% to fall in cell₁₃ if rows and columns are independent. Assume that the corresponding percentages in row 1 were 20% and 5% for cell₁₁ and cell₁₂ Then a random number between 0 and .20 would be assigned to cell₁₁, a number between .20001 and .25 would be assigned to cell₁₂, a number between .2500001 and .37 would be assigned to cell₁₃, and so on. The cell frequencies that result would be a random sample of cell frequencies having a total sample size of N. The rest of the process is the same as above. Out of 10,000 tables, and their associated chi-squares, the probability under the null would be the proportion of them that equaled or exceeded the chi-square for the obtained data. The R code to carry out this analysis is presented below.


obtchisq <- chisq.test(origdata, correct = F)$statistic    

nreps <- 10000                    
xx <- numeric(N)
results <- numeric(nreps)
greater <- 0                
cells <- numeric(ncells)
cells <- c(1:ncells) 
# Now resample nreps times

for (i in 1:nreps) {
   xx <- sample(cells,N,replace = TRUE, prob = expectedvector)
   vv <- matrix(0,1,ncells)
   for (j in 1:N) {
      b <- xx[j]
      vv[b] = vv[b] + 1
   }
   x <- matrix(vv,nrow = nr, byrow = TRUE)
   # x is now the contingency table for ith random sample
   
   results[i]<- chisq.test(x, correct = FALSE)$statistic
   greater <- ifelse(results[i] >= obtchisq, greater + 1, greater + 0)  
}

dfs <- (nr - 1)*(nc - 1)
cat("For a chi-square distribution the mean should be = ", dfs, ". It is actually
 ",mean(results, na.rm = TRUE),"\n")
cat("\n")
cat("For a chi-square distribution the var. should be = ", 2*dfs,". It is actually
 ",var(results, na.rm = "TRUE"), "\n")
cat("\n")

cat(" Look at the graphic for the distribution and a qq plot", "\n")
cat("\n")
par(mfrow=c(2,2))
hist(results, xlab = "Chi-Square", ylab = "Frequency", breaks = 50,
 main = "Chi-square distribution")
#
#
#
library(car)
qqPlot(results, dist = "chisq", df = dfs)
#pvalues <- c(1:9999)/10000         #generate a set of probabilities
   p = .00001 - .99999
#qquant <- qchisq(p = pvalues,df = 1)       #convert those to quantiles of
   chi-sq distrib.
##
#temp <- sort(results, na.last = NA)      #put the obtained quantiles in order
##
#temp <- temp[-10000]  #make them the same length
#plot(qquant,temp, main = "QQ Plot for Chi-Square", xlab = "Obtained Chi-Square",
#    ylab = "Expected Chi-Square") #make a qqplot
#abline(0,1)

moreextreme <- results[results > obtchisq]
moreextreme <- length(moreextreme)/nreps
cat("The proportion of cases with results at least as extreme as our data 
    is  ",moreextreme, "\n")
cat("\n") 
#Using Pearson's chi-square
cat("The following is the result of using Pearson's basic chi-square.","\n")
cat("\n")
print(chisq.test(origdata, correct = FALSE))
#
#Using monte-carlo simulation
cat("The following is the result of letting R do a simulation for chisq.test",
   "\n")
cat           ("\n")
print(chisq.test(origdata, simulate.p.value = TRUE, B = 10000))
#
cat("The following is the result of Fisher's exact test ","\n")
cat("\n")
print(fisher.test(origdata, conf.level = 0.95))
cat("\n")
cat("Do not be alarmed by warnings if you have some small marginals. ", "\n")
cat("They come because I used the chi-square statistic as my indicator
 of disparity.", "\n")

Nothing Fixed

The logical next step would be to consider a study where we went into classrooms and asked students to vote. In this case we not only do not know how many men and women there will be, nor the number of For and Against votes, but we don't even know the total sample size.

It may be possible to generate random samples with this design, but I have not figured out how to do it. I would first have to draw a random sample size, then repeated the calculations in the previous section, then draw another random sample size, and so on. I seriously doubt that it is worth the computational time needed to carry out this process. I suppose that it could be done, but I'm not going to do it.

Conclusion

In many situations the above procedures might be considered overkill. When we have large expected frequencies in all of the cells, the Pearson chi-square test is quite appropriate. The difference in probability values between that statistic and the ones given here will be small. However when we have one or more small expected frequencies things start to fall apart. Campbell (2007) in an extensive study of 2 × 2 tables, concluded that Fisher's Exact Test works well enough when we have small expected frequencies. He also concluded that a modified Pearson's chi-square [χ²(N/(N-1))] is satisfactory when the smallest expected frequency is greater than 1 for 2 × 2 tables. (I believe that this adjustment can be attributed to Karl Pearson's son Egon, who was also a noted statistician.) However there is no reason why we need to be satisfied with "well enough" when we have a more appropriate solution via randomization tests.

Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistical in Medicine, 26, 3661-3675.

Fisher, R. A. (1935). The Design of Experiments. Edinburgh, Oliver and Boyd.

Howell, D. C. ,(2010). Chi-square test: Analysis of contingency tables. In Lovric, M. (2011). International Encyclopedia of Statistical Science, Springer-Verlag, Berlin.

Yates, F. (1934). Contingency tables involving small numbers and the χ² test. Journal of the Royal Statistical Society Supplement 1:217-235.

Send mail to: David.Howell@uvm.edu

Last revised 7/5/2018