As you know from the text, the shape of the chi-square distribution varies with the degreees of freedom. The first plot varies those df from 1 to 20. I have restricted the x axis for the range 1 - 15 simply to allow you to see them on the same plot. However, you could take this code, use different degreees of freedom, and replot. Suppose that you used 30, 50, and 75 df. Then you could change the relevant line to something like chisq <- seq(from = 20, to = 70, by = .1) You would also have to change the limits on the x axis. From looking at those plots, can you guess what the mean of a chi-square distribution is? (How about the df?)
### Plotting the chi-square distribution par(mfrow = c(2,3)) df <- c(1, 2, 4, 6, 10, 30) x.axis.max <- 30 chisq <- seq(from = 1, to = x.axis.max, by = .1) for (i in 1:6) { den <- dchisq(chisq, df = df[i]) plot(den ~ chisq, type = "l", ylim = c(0, 0.30)) legend(2, 0.20, paste("df = ",df[i]) , bty = "n") }
On the web page containing just the R code that is associated with this chapter, (www.uvm.edu/~dhowell/R-Code/Chapter7.R you will find similar code for finding the critical value of chi-square for 4 df and related things.
I don't plan to go through every example in the book, but I will start with the first one on therapeutic touch. The code is below. I chose this because I wanted to point out that the part of the code that says "correct = FALSE" is intended to prevent R from including a correction for continuity. As I said in the text, this correction is often unwise.
Notice how I can control the number of digits that get printed. If you want to explore why I could refer to result$statistic, simply type "str(result)". You can play around with things like "result$observed" and other variants. What you see in the code as "result" is really an "object." It has a bunch of things hanging off of it, one of which is the chi-square value, another is the set of observed scores, etc.
### Therapeutic Touch outcome <- c(123, 157) result <- chisq.test(x = outcome, correct = FALSE) print(result) prob <- 1-pchisq(q = result$statistic, df = 1) # Computing the probability # under the null hypothesis options(digits = 3) cat("The probability for that chi-square = ", prob) __________________________________________________________________ Chi-squared test for given probabilities data: outcome X-squared = 4.1286, df = 1, p-value = 0.04216 The probability for that chi-square = 0.0422
I will skip the example of Rock, Paper, Scissors, but the code is in the document on R coding. I want to go ahead and examine the data on death sentence and race by Unah and Borger (2001). This is basic a 2 X 2 table and you can see that the data were read in as a matrix. Notice how that matrix was created I simply entered the cell frequencies, specified the number of columns, and told it that the data were to be read into the matrix by rows. To make things neater I gave names to the rows and columns. If you look at the code in Chapter7.R you will see that I accomplished this labeling slightly differently, but with the same result. Although I am skipping ahead in the chapter, I wanted to bring in material on odds ratios and risk ratios, so I loaded the "epitools" library. That is needed to work with odds ratios and risk ratios. You will probably have to install this on your machine first. The command would be install.packages(epitools).
One problem I have with this analysis is that I am never sure which risk or odds ratio it is going to produce. Is it the ratio of death to no-death, or white to nonwhite, or the reverse of those. I suggest that you first solve a very simple problem by hand to see what the outcome should be, and then play with the entry labeled "rev." You can choose among rev = c("neither", "rows", "columns", "both") . For example, the risk of heart attack if you do not take aspirin is 189/10845 = .01713. If you do take aspirin, it is 104/11037 = .00942. This gives us a risk ratio of .01713/.00942 = 1.818. But the first output says that the risk ratio is .55, which is the inverse of .55 (i.e. 1/1.818). But if we reverse the columns, the ratio comes out correctly. Once you have it producing what you want, you are all set. There has to be a better answer to that, but I haven't figured it out.
I have included Fisher's Exact Test in the second set of code given here. It will also print out the odds ratio, which is handy. I have also inserted the oddsratio and the riskratio functions. They are a bit awkward to read, but fisher.test gave you the odds ratio, so you can look for that value in the printout and figure out what they are up to. I must say that the output for risk ratios and odds ratios is difficult to interpret at best. But even without a calculator, it is clear that a white person is about twice as likely to be sentenced to "no-death" than a black person. From an odds ratio of 2.02, it is clear then that R is talking about the odds of not being sentenced to death, and it is using whites as the reference group.
### First with the aspirin example library(epitools) contin.table <- matrix(c(104, 10933, 189, 10845), byrow = TRUE, ncol = 2) ### Notice how I supply labels. rownames(contin.table) <- c("Aspirin","Placebo") colnames(contin.table) <- c("Heart Attack","No Heart Attacl") print(contin.table) riskratio(contin.table, correction = FALSE, rev = "both") riskratio(contin.table, correction = FALSE, rev = "col") oddsratio(contin.table, correction = FALSE, rev = "rows") Risk.Aspirin <- contin.table[1,1]/sum(contin.table[1,]) RR <- Risk.Aspirin/Risk.Placebo Odds.Placebo <- contin.table[2,1]/contin.table[2,2] Odds.Aspirin <- contin.table[1,1]/contin.table[1,2] OR <- Odds.Aspirin/Odds.Placebo cat("The Risk Ratio for heart attack given Aspirin is = ",RR,"\n") cat("The Odds ratio is for heart attack given Aspirin = ",OR,"\n") cat("The two ratios are almost the same because of the very low probability of a heart attack") _________________________________________________________________ Heart Attack No Heart Attack Aspirin 104 10933 Placebo 189 10845 > riskratio(contin.table, correction = FALSE, rev = "both") $data No Heart Attacl Heart Attack Total Placebo 10845 189 11034 Aspirin 10933 104 11037 Total 21778 293 22071 $measure NA risk ratio with 95% C.I. estimate lower upper Placebo 1.00 NA NA Aspirin 0.55 0.434 0.698 $p.value NA two-sided midp.exact fisher.exact chi.square Placebo NA NA NA Aspirin 4.99e-07 5.03e-07 5.69e-07 $correction [1] FALSE attr(,"method") [1] "Unconditional MLE & normal approximation (Wald) CI" > riskratio(contin.table, correction = FALSE, rev = "col") $data No Heart Attacl Heart Attack Total Aspirin 10933 104 11037 Placebo 10845 189 11034 Total 21778 293 22071 $measure NA risk ratio with 95% C.I. estimate lower upper Aspirin 1.00 NA NA Placebo 1.82 1.43 2.31
Next I move on to the Death Sentence data. that should be relatively clear.
### Death Sentence data of Unah and Boger (2001) library(epitools) # Needed for risk and odds sentencing <- matrix(c(33, 251, 33, 508) ,ncol = 2, byrow = TRUE, dimnames = list( c("NonWhite", "White"), c("Death", " No Death"))) print("Data from Unah and Boger on death sentences \n") print(sentencing) chisq.test(sentencing, correct = FALSE) fisher.test(sentencing) # Just to be complete oddsratio(sentencing, correction = FALSE, rev = "neither") riskratio(sentencing, correction = FALSE, rev = "both") _____________________________________________________________ Death No Death NonWhite 33 251 White 33 508 Pearson's Chi-squared test data: sentencing X-squared = 7.7099, df = 1, p-value = 0.005492 Fisher's Exact Test for Count Data p-value = 0.006809 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.179576 3.467215 sample estimates: odds ratio 2.022094 odds ratio with 95% C.I. estimate lower upper NonWhite 1.000000 NA NA White 2.021961 1.214768 3.366206 $p.value two-sided midp.exact fisher.exact chi.square NonWhite NA NA NA White 0.006959876 0.00680891 0.005491987 risk ratio with 95% C.I. estimate lower upper White 1.00000 NA NA NonWhite 1.90493 1.201782 3.01948 $p.value two-sided midp.exact fisher.exact chi.square White NA NA NA NonWhite 0.006959876 0.00680891 0.005491987
The next analysis is the analysis on Jankowski's data on sexual abuse. This comes from Section 7.5 in the text. I included it, in part, because it illustrates the issue of what ratios R is looking for. You will want to compare what this produces with what I had in the text. Notice the way in which I supplied the names for the different levels on each dimension. This makes it much easier to read the output.
I wanted you to see how I was able to plot these data. I plotted them against the risk ratio for No Abuse. The plot makes the effect that we are looking at much more apparent.
# Jankowski's study of sexual abuse library(epitools) Jankowski <- matrix(c(512, 54, 227, 37, 59, 15, 18, 12),byrow = TRUE, ncol = 2, dimnames = list( c("Never", "Once", "Twice", "3-4"), c("No Abuse", " Abuse")))) print(Jankowski) chisq.test(Jankowski, correct = FALSE) fisher.test(Jankowski) oddsratio(Jankowski, correction = FALSE,rev = "neither") riskratio(Jankowski, correction = FALSE, rev = "neither") ### Plotting risk ratios for Sex Abuse risk <- c(1, 1.47, 2.14, 4.21) category <- c(0,1,2,3) plot(risk~category, type = "b", xlab = "Sexual Abuse Category", ylab = "Risk Ratio", main = "Risk Ratio Relative to Category = 0") _______________________________________________________________________![]()
The complete R code page for this chapter includes a number of examples that I have skipped over on this page. I don't want to cover many more of them here. One of the sections in those sets of code involves the likelihood ratio test, and I will elaborate on it here. I put that in just to show an example of creating our own function. You are probably not going to create functions like this, but you will sometimes come across them in other code, and I think you should have at least a vague idea of what is going on.
You know that if you want to calculate a mean of x, you just type "mean(x)." Someone once wrote a function to do that calculation and built it into the base package of R. You are calling that mean() function when you type "mean(x). But I want to calculate a likelihood ratio chi-square. One thing that I could do is to write my own function. I start out with a line that looks like likelihood.test = function(data) { ... }. The "likelihood.test" part is the name of the function. The "function" part tells R that this is going to be a function, and the "data" part says that I am going to pass something to the function, and within the function it will be known as "data." The "{" begins the actual function and within the function, "data" is the internal name of what it is sent, But outside the function there may be nothing named data. When I call the function with something like "likelihood(sentencing)", it knows that "sentencing" is the thing that is to be called "data" within the function. Then I have a bunch of stuff between the two brackets. You may remember that I gave the following formula for the likelihood ratio chi-square.
The first thing that the function does is to use the data to compute chi-square, and calls the result "test." But "test" is an object. It contains the result of the test, but it also contains the expected values that it computed, the observed values that it was fed, and the degrees of freedom. Type str(test to see what the object contains.
Click to see the structure of the "test" object.
> str(test) List of 9 $ statistic: Named num 29.6 ..- attr(*, "names")= chr "X-squared" $ parameter: Named int 3 ..- attr(*, "names")= chr "df" $ p.value : num 1.65e-06 $ method : chr "Pearson's Chi-squared test" $ data.name: chr "Jankowski" $ observed : num [1:4, 1:2] 512 227 59 18 54 37 15 12 ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" $ expected : num [1:4, 1:2] 494.5 230.6 64.7 26.2 71.5 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" $ residuals: num [1:4, 1:2] 0.787 -0.24 -0.703 -1.604 -2.07 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" $ stdres : num [1:4, 1:2] 3.529 -0.798 -2.061 -4.586 -3.529 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:4] "Never" "Once" "Twice" "3-4" .. ..$ : chr [1:2] "No Abuse" " Abuse" - attr(*, "class")= chr "htest"
My function uses that object to get the necessary information that it needs, i.e. expected and observed values and df. The function then goes on to compute something that it names LR, which equals 2 times the sum of the observed values times the log of the observed values divided by the expected values. Having calculated LR, the function returns "LR" as the value of "Likelihood." Then everything within the function (i. e. between { and } ) disappears from memory. As I said, I don't expect that you will go out and start writing functions. But I hope that you now have at least a vague idea of what they do.
# Create a function to calculate likelihood ratio chisq likelihood.test <- function(data) { test <- chisq.test(data) Exp <- test$expected; Obs <- test$observed; df <- test$parameter; LR=2*sum(data*log(data/Exp)) } sentencing <- matrix(c(33, 251, 33, 508) ,ncol = 2, byrow = TRUE, dimnames = list( c("NonWhite", "White"), c("Death", "No Death"))) Likelihood <- likelihood.test(sentencing) # Call likelihood.test with sentencing data cat("The Likelihood ratio = \n", Likelihood) df <- 1 prob <- 1 - pchisq(Likelihood, df) cat("The probability under the null = \n", prob)
As I said in the text, the Mantel-Haenszel test works with contingency tables in three dimensions. R has a function called mantelhaen.test, which takes a three dimensional array as its input. The code below will apply that test to the data and produce the test statistic and probability. The test is straight-forward, but I suggest that you work it first with a known example so that you are sure how to set up the array. (In R an "array" can have several dimensions. A vector is a one dimensional array, and a matrix is a two dimensional array. R does not recognize a three dimensional "matrix"--it is just an "array.") As you can see, the printout is not particularly extensive, but it tells you what you need to know.
### Mantel-Haenszel Test Berkeley Data Berkeley <- array(c(353, 207, 17, 8, 120, 205, 202, 391, 138, 279, 131, 244, 53, 138, 94, 299, 22, 351, 24, 317), dim = c(2,2,5), dimnames = list( Decision = c("Admit", "Reject"), Gender = c("Male", "Female"), Major = c("B", "C","D", "E", "F" ))) print(Berkeley) mantelhaen.test(Berkeley) __________________________________________________________ Mantel-Haenszel chi-squared test with continuity correction data: Berkeley Mantel-Haenszel X-squared = 0.096186, df = 1, p-value = 0.7565 alternative hypothesis: true common odds ratio is not equal to 1 95 percent confidence interval: 0.8701288 1.2216845 sample estimates: common odds ratio 1.03103
Finally, I want to show you two pieces of code that will calculate phi and Cohen's Kappa. I'm not going to comment on these, but they should be self-explanatory. To get them you have to load the "psych" library.
### Phi and Cramer's V ### install.packages("psych") library(psych) data <- c(104, 10933, 189, 10845) phi(data, digits = 3) ### Cohen's kappa cohen <- matrix(c(.50, .033, 0, .0667, .1, .033,.1, .067, .1), byrow = FALSE, ncol = 3) cohen.freq <- matrix(c(15, 2, 3, 1, 3, 2, 0, 1, 3),byrow = TRUE, ncol = 3) cohen.kappa(cohen.freq)
That is all for now.