GreenBlueBar.gif GreenBlueBar.gif

R Code for Chapter 7

Chi-Square


As you know from the text, the shape of the chi-square distribution varies with the degreees of freedom. The first plot varies those df from 1 to 20. I have restricted the x axis for the range 1 - 15 simply to allow you to see them on the same plot. However, you could take this code, use different degreees of freedom, and replot. Suppose that you used 30, 50, and 75 df. Then you could change the relevant line to something like chisq <- seq(from = 20, to = 70, by = .1) You would also have to change the limits on the x axis. From looking at those plots, can you guess what the mean of a chi-square distribution is? (How about the df?)


### Plotting the chi-square distribution
par(mfrow = c(2,3))
df <- c(1, 2, 4, 6, 10, 30)
x.axis.max <- 30
chisq <- seq(from = 1, to = x.axis.max, by = .1)
for (i in 1:6) {
   den <- dchisq(chisq, df = df[i])
   plot(den ~ chisq, type = "l", ylim = c(0, 0.30))
   legend(2, 0.20, paste("df = ",df[i]) , bty = "n")
}

On the web page containing just the R code that is associated with this chapter, (www.uvm.edu/~dhowell/R-Code/Chapter7.R you will find similar code for finding the critical value of chi-square for 4 df and related things.

I don't plan to go through every example in the book, but I will start with the first one on therapeutic touch. The code is below. I chose this because I wanted to point out that the part of the code that says "correct = FALSE" is intended to prevent R from including a correction for continuity. As I said in the text, this correction is often unwise.

Notice how I can control the number of digits that get printed. If you want to explore why I could refer to result$statistic, simply type "str(result)". You can play around with things like "result$observed" and other variants. What you see in the code as "result" is really an "object." It has a bunch of things hanging off of it, one of which is the chi-square value, another is the set of observed scores, etc.


### Therapeutic Touch
outcome <- c(123, 157)
result <- chisq.test(x = outcome, correct = FALSE)
print(result)
prob <- 1-pchisq(q = result$statistic, df = 1)  # Computing the probability 
   # under the null hypothesis
options(digits = 3)
cat("The probability for that chi-square = ", prob)
__________________________________________________________________
	
	Chi-squared test for given probabilities

data:  outcome
X-squared = 4.1286, df = 1, p-value = 0.04216
The probability for that chi-square =  0.0422
 

I will skip the example of Rock, Paper, Scissors, but the code is in the document on R coding. I want to go ahead and examine the data on death sentence and race by Unah and Borger (2001). This is basic a 2 X 2 table and you can see that the data were read in as a matrix. Notice how that matrix was created I simply entered the cell frequencies, specified the number of columns, and told it that the data were to be read into the matrix by rows. To make things neater I gave names to the rows and columns. If you look at the code in Chapter7.R you will see that I accomplished this labeling slightly differently, but with the same result. Although I am skipping ahead in the chapter, I wanted to bring in material on odds ratios and risk ratios, so I loaded the "epitools" library. That is needed to work with odds ratios and risk ratios. You will probably have to install this on your machine first. The command would be install.packages(epitools).

One problem I have with this analysis is that I am never sure which risk or odds ratio it is going to produce. Is it the ratio of death to no-death, or white to nonwhite, or the reverse of those. I suggest that you first solve a very simple problem by hand to see what the outcome should be, and then play with the entry labeled "rev." You can choose among rev = c("neither", "rows", "columns", "both") . For example, the risk of heart attack if you do not take aspirin is 189/10845 = .01713. If you do take aspirin, it is 104/11037 = .00942. This gives us a risk ratio of .01713/.00942 = 1.818. But the first output says that the risk ratio is .55, which is the inverse of .55 (i.e. 1/1.818). But if we reverse the columns, the ratio comes out correctly. Once you have it producing what you want, you are all set. There has to be a better answer to that, but I haven't figured it out.

I have included Fisher's Exact Test in the second set of code given here. It will also print out the odds ratio, which is handy. I have also inserted the oddsratio and the riskratio functions. They are a bit awkward to read, but fisher.test gave you the odds ratio, so you can look for that value in the printout and figure out what they are up to. I must say that the output for risk ratios and odds ratios is difficult to interpret at best. But even without a calculator, it is clear that a white person is about twice as likely to be sentenced to "no-death" than a black person. From an odds ratio of 2.02, it is clear then that R is talking about the odds of not being sentenced to death, and it is using whites as the reference group.


### First with the aspirin example

library(epitools)
contin.table <- matrix(c(104, 10933, 189, 10845), byrow = TRUE, ncol = 2)
### Notice how I supply labels.
rownames(contin.table) <- c("Aspirin","Placebo")
colnames(contin.table) <- c("Heart Attack","No Heart Attacl")
print(contin.table)
riskratio(contin.table, correction = FALSE,  rev = "both")
riskratio(contin.table, correction = FALSE,  rev = "col")
oddsratio(contin.table, correction = FALSE, rev = "rows")  

Risk.Aspirin <- contin.table[1,1]/sum(contin.table[1,])
RR <- Risk.Aspirin/Risk.Placebo
Odds.Placebo <- contin.table[2,1]/contin.table[2,2]
Odds.Aspirin <- contin.table[1,1]/contin.table[1,2]
OR <- Odds.Aspirin/Odds.Placebo
cat("The Risk Ratio  for heart attack given Aspirin is = ",RR,"\n")
cat("The Odds ratio is for heart attack given Aspirin  = ",OR,"\n")
cat("The two ratios are almost the same because of the very low probability
 of a heart attack")

_________________________________________________________________

        Heart Attack No Heart Attack
Aspirin          104           10933
Placebo          189           10845
> riskratio(contin.table, correction = FALSE,  rev = "both")
$data
        No Heart Attacl Heart Attack Total
Placebo           10845          189 11034
Aspirin           10933          104 11037
Total             21778          293 22071

$measure
                        NA
risk ratio with 95% C.I. estimate lower upper
                 Placebo     1.00    NA    NA
                 Aspirin     0.55 0.434 0.698

$p.value
         NA
two-sided midp.exact fisher.exact chi.square
  Placebo         NA           NA         NA
  Aspirin   4.99e-07     5.03e-07   5.69e-07

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
> riskratio(contin.table, correction = FALSE,  rev = "col")
$data
        No Heart Attacl Heart Attack Total
Aspirin           10933          104 11037
Placebo           10845          189 11034
Total             21778          293 22071

$measure
                        NA
risk ratio with 95% C.I. estimate lower upper
                 Aspirin     1.00    NA    NA
                 Placebo     1.82  1.43  2.31

Next I move on to the Death Sentence data. that should be relatively clear.



### Death Sentence data of Unah and Boger (2001)
 
library(epitools)            # Needed for risk and odds
sentencing <- matrix(c(33, 251, 33, 508) ,ncol = 2, byrow = TRUE,
  dimnames = list( c("NonWhite", "White"), c("Death", "   No Death")))
print("Data from Unah and Boger on death sentences \n")
print(sentencing)
chisq.test(sentencing, correct = FALSE)
fisher.test(sentencing)     # Just to be complete
oddsratio(sentencing, correction = FALSE, rev = "neither")  
riskratio(sentencing, correction = FALSE,  rev = "both")
_____________________________________________________________
 
         Death    No Death
NonWhite    33         251
White       33         508

	Pearson's Chi-squared test
data:  sentencing
X-squared = 7.7099, df = 1, p-value = 0.005492

	Fisher's Exact Test for Count Data

p-value = 0.006809

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:
 1.179576 3.467215

sample estimates:
odds ratio 
  2.022094 


odds ratio with 95% C.I. estimate    lower    upper
                NonWhite 1.000000       NA       NA
                White    2.021961 1.214768 3.366206
$p.value

two-sided   midp.exact fisher.exact  chi.square
  NonWhite          NA           NA          NA
  White    0.006959876   0.00680891 0.005491987

risk ratio with 95% C.I. estimate    lower   upper
                White     1.00000       NA      NA
                NonWhite  1.90493 1.201782 3.01948
$p.value

two-sided   midp.exact fisher.exact  chi.square
  White             NA           NA          NA
  NonWhite 0.006959876   0.00680891 0.005491987

The next analysis is the analysis on Jankowski's data on sexual abuse. This comes from Section 7.5 in the text. I included it, in part, because it illustrates the issue of what ratios R is looking for. You will want to compare what this produces with what I had in the text. Notice the way in which I supplied the names for the different levels on each dimension. This makes it much easier to read the output.

I wanted you to see how I was able to plot these data. I plotted them against the risk ratio for No Abuse. The plot makes the effect that we are looking at much more apparent.


# Jankowski's study of sexual abuse

library(epitools)
Jankowski <- matrix(c(512, 54, 227, 37, 59, 15, 18, 12),byrow = TRUE, ncol = 2,
 dimnames = list( c("Never", "Once", "Twice", "3-4"), c("No Abuse", " Abuse"))))
print(Jankowski)
chisq.test(Jankowski, correct = FALSE)
fisher.test(Jankowski)
oddsratio(Jankowski, correction = FALSE,rev = "neither")  
riskratio(Jankowski, correction = FALSE, rev = "neither")

### Plotting risk ratios for Sex Abuse
risk <- c(1, 1.47, 2.14, 4.21)
category <- c(0,1,2,3)
plot(risk~category, type = "b", xlab = "Sexual Abuse Category",
 ylab = "Risk Ratio", main = "Risk Ratio Relative to Category = 0")
 
 _______________________________________________________________________


The complete R code page for this chapter includes a number of examples that I have skipped over on this page. I don't want to cover many more of them here. One of the sections in those sets of code involves the likelihood ratio test, and I will elaborate on it here. I put that in just to show an example of creating our own function. You are probably not going to create functions like this, but you will sometimes come across them in other code, and I think you should have at least a vague idea of what is going on.

You know that if you want to calculate a mean of x, you just type "mean(x)." Someone once wrote a function to do that calculation and built it into the base package of R. You are calling that mean() function when you type "mean(x). But I want to calculate a likelihood ratio chi-square. One thing that I could do is to write my own function. I start out with a line that looks like likelihood.test = function(data) { ... }. The "likelihood.test" part is the name of the function. The "function" part tells R that this is going to be a function, and the "data" part says that I am going to pass something to the function, and within the function it will be known as "data." The "{" begins the actual function and within the function, "data" is the internal name of what it is sent, But outside the function there may be nothing named data. When I call the function with something like "likelihood(sentencing)", it knows that "sentencing" is the thing that is to be called "data" within the function. Then I have a bunch of stuff between the two brackets. You may remember that I gave the following formula for the likelihood ratio chi-square.

The first thing that the function does is to use the data to compute chi-square, and calls the result "test." But "test" is an object. It contains the result of the test, but it also contains the expected values that it computed, the observed values that it was fed, and the degrees of freedom. Type str(test to see what the object contains.


Click to see the structure of the "test" object.

> str(test) List of 9 $ statistic: Named num 29.6 ..- attr(*, "names")= chr "X-squared" $ parameter: Named int 3 ..- attr(*, "names")= chr "df" $ p.value : num 1.65e-06 $ method : chr "Pearson's Chi-squared test" $ data.name: chr "Jankowski" $ observed : num [1:4, 1:2] 512 227 59 18 54 37 15 12 ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" $ expected : num [1:4, 1:2] 494.5 230.6 64.7 26.2 71.5 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" $ residuals: num [1:4, 1:2] 0.787 -0.24 -0.703 -1.604 -2.07 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" $ stdres : num [1:4, 1:2] 3.529 -0.798 -2.061 -4.586 -3.529 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" - attr(*, "class")= chr "htest"

My function uses that object to get the necessary information that it needs, i.e. expected and observed values and df. The function then goes on to compute something that it names LR, which equals 2 times the sum of the observed values times the log of the observed values divided by the expected values. Having calculated LR, the function returns "LR" as the value of "Likelihood." Then everything within the function (i. e. between { and } ) disappears from memory. As I said, I don't expect that you will go out and start writing functions. But I hope that you now have at least a vague idea of what they do.


#  Create a function to calculate likelihood ratio chisq
likelihood.test <- function(data) {
  test <- chisq.test(data)
  Exp <- test$expected;  Obs <- test$observed;  df <- test$parameter;
  LR=2*sum(data*log(data/Exp))
  }

sentencing <- matrix(c(33, 251, 33, 508) ,ncol = 2, byrow = TRUE,
  dimnames = list( c("NonWhite", "White"), c("Death", "No Death")))  
Likelihood <- likelihood.test(sentencing) # Call likelihood.test with sentencing data
cat("The Likelihood ratio = \n", Likelihood)
df <- 1
prob <- 1 - pchisq(Likelihood, df)
cat("The probability under the null = \n", prob)

The Mantel-Haenszel Test

As I said in the text, the Mantel-Haenszel test works with contingency tables in three dimensions. R has a function called mantelhaen.test, which takes a three dimensional array as its input. The code below will apply that test to the data and produce the test statistic and probability. The test is straight-forward, but I suggest that you work it first with a known example so that you are sure how to set up the array. (In R an "array" can have several dimensions. A vector is a one dimensional array, and a matrix is a two dimensional array. R does not recognize a three dimensional "matrix"--it is just an "array.") As you can see, the printout is not particularly extensive, but it tells you what you need to know.


### Mantel-Haenszel Test Berkeley Data
Berkeley <- array(c(353, 207,
                    17, 8,
                    120, 205,
                    202, 391,
                    138, 279,
                    131, 244,
                    53, 138,
                    94, 299,
                    22, 351,
                    24, 317),
                    dim = c(2,2,5),
                    dimnames = list(
                    Decision = c("Admit", "Reject"),
                    Gender = c("Male", "Female"),
                    Major = c("B", "C","D", "E", "F" )))
print(Berkeley)
mantelhaen.test(Berkeley)

__________________________________________________________

Mantel-Haenszel chi-squared test with continuity correction

data:  Berkeley
Mantel-Haenszel X-squared = 0.096186, df = 1, p-value = 0.7565
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
   0.8701288 1.2216845
sample estimates:
   common odds ratio 
1.03103 

Finally, I want to show you two pieces of code that will calculate phi and Cohen's Kappa. I'm not going to comment on these, but they should be self-explanatory. To get them you have to load the "psych" library.


### Phi and Cramer's V
### install.packages("psych")
library(psych)
data <- c(104, 10933, 189, 10845)
phi(data, digits = 3)

### Cohen's kappa
cohen <- matrix(c(.50, .033, 0,
  .0667, .1, .033,.1, .067, .1), byrow = FALSE, ncol = 3)
cohen.freq <- matrix(c(15, 2, 3, 1, 3, 2, 0, 1, 3),byrow = TRUE, ncol = 3)
cohen.kappa(cohen.freq)

That is all for now.

GreenBlueBar.gif GreenBlueBar.gif dch: