Chi-Square Analyses of Categorical Data



  • Hand back labs


  • Categorical (Qualitative) data
    • Point to differences from data on which we calculate means and standard deviations.
      • We have seen examples of each in lab, with Liddle's means and Siegel's frequencies.
    • Our individual observations can be classified into a few bins or cells
      • The basic result is a set of counts.
  • I'll start with the simplest possible models and move up in complexity.
    • Note that I am always asking whether a particular model will fit a particular set of data.
      • The model usually defines the null hypothesis.
    • A model is a statement about the process behind the data. For example:
      • "The data are normally distributed"
      • "The data depend only on row and column parameters."
      • "The data are equally distributed across a set of cells."
      • etc.

Goodness of fit tests

(Present this as a problem concerning therapeutic touch, rather than as a statistical technique looking for an example.)

  • These are the simplest models.
  • Responses are distributed along one dimension, as opposed to two or more for the next type of test.
  • Example
    • The following came from an excellent website named Chance News, and the data that I am going to talk about come from an article they cite in the New York Times.
    • Therapeutic touch.
    A child's paper poses a medical challenge. The New York Times, 1 April, 1998, A1  Gina Kolata 

    {The paper actually appeared in the 1996 edition of JAMA, the abstract of which is attached. dch. The full paper can be found at In addition, a presentation on this issue for a Chance Course Seminar can be found at}

    The practice of therapeutic touch is used in hospitals all over the world and is taught in medical and nursing schools. In this therapy, trained practitioners manipulate something that they call the 'human energy field'. This manipulation is carried out without actually touching the patient's body. Practitioners of this therapy claim that anyone can be trained to feel this energy field.

    Some researchers say that there exists no reliable evidence showing that this technique heals patients. Dr. Donal O'Mathuna, a professor of bioethics and chemistry at the Mount Carmel School of Nursing in Columbus, Ohio, has reviewed more than 100 papers and doctoral dissertations on this technique without finding any convincing data.

    James Randi, a magician who is a well-known skeptic of some types of alternative medicine, has been trying to test the practice of therapeutic touch for years. Only one person has agreed to submit to his test, but when she was tested she did no better than chance in detecting the energy field.

    Emily Rosa, an 11-year-old in Colorado, was able to recruit 21 practitioners of therapeutic touch in an experiment she conducted two years ago. Emily's mother, who is a nurse and a skeptic of this technique, believes that Emily was able to recruit this many subjects because they did not feel threatened by a 9-year-old girl working on a project for a science fair.

    Her test consisted of placing a screen between a subject's eyes and hands, and then holding her own hand over one hand or the other of the subject. The premise of this experiment is that if, in fact, the subject can feel Emily's energy field, then the subject should be able to determine over which hand Emily's hand is being held. Emily conducted 280 tests with 21 subjects, and they identified the correct location of her hand in 44% of the tests.

    The results of her study were reported this month in the Journal of the American Medical Association. Reaction was swift from proponents of the therapeutic touch technique. Meanwhile, Emily recently received a letter from the Guiness Book of World Records, saying she may be the youngest person ever to publish a paper in a major scientific journal.


    (1) One practitioner of therapeutic touch, in response to Emily's results, stated that people who use this technique rely on more than just touch to sense the energy field. They also use 'the sense of intuition and even a sense of sight'. Other users of this method claim that patients who are ill have hot or cold spots in their energy fields in some cases, or have areas that feel tingly. Can you design an experiment that allows the practitioner to use senses of touch, sight, and intuition, and that still tests whether the technique is a valid one?

    (2) How likely is so poor a showing as 123 or fewer correct responses out of 280 tests, given no real ability to detect the energy field?

Emily Rosa    


Emily giving her keynote speech at the 1998 Ig Nobel Award ceremonies.

Emily was also a 1998 invited speaker at the Ig Nobel prize ceremony in Boston, a highly contested prize awarded by a committee associated with the Annals of Improbable Research. The committee awarded an Ig Nobel prize, which is not a badge of honor, to Doris Kreiger for her paper on Therapeutic Touch.

Emily found a 44% accuracy rate out of 280 trials, which would mean that subjects were correct on 123 trials, and incorrect on 157 trials. We want to know if this result is more divergent from a 50/50 split than would be expected by chance.

I recognize that this is actually a worse than chance outcome for those who support TT, but we didn't know that before the experiment began. It still makes sense to test the null hypothesis that the probability of a correct response is .50.

  • We are going to test the hypothesis that there would be 123 or fewer correct responses out of 280 tests if subjects are not able to sense the presence of Emily's hand. (We will also turn this into a two-tailed test, which I much prefer.)
  • We are actually ignoring a problem that I would not ignore in "real life." Because the 280 responses came from 21 subjects, this means that the responses technically are not independent, although it is hard to believe that if subjects cannot sense the presence of Emily's hand, the lack of independence would create difficulties.
    • Ask what difficulties it might create if there is such a thing as therapeutic touch.
  • There are several different ways to do this.
  • z test
    • We have already talked about this. (Well, I wrote about it, for those who have read the past notes.)
    • I'm going to elaborate on the z test because it parallels chi-square in the simple case.
    • If we repeat this experiment an infinite number of times, where the probability of a correct response is .50 and there are 280 trials, we'll have a distribution of outcomes.
      • Notice that there have been many occasions this semester when we actually did repeat an experiment many times. This is the first time that we actually see what would happen without going through all of those repetitions. That is really what statistical tests are all about.
    • These outcomes will be normally distributed, with a mean of Np and a variance of Npq, where N = 280, p = .50, and q = 1-.50 = .50
      • These values come from what statisticians know about the binomial distribution, which I mentioned about 2 classes back.
      • This is a mean of 140 and a variance of 70, for a standard deviation of 8.367

  • With a one-tailed test, the probability of a z as low as 123 = .0212
  • For a two-tailed test (X < 123 or X > 157) this would be 2(.0212) = .0424
  • We would reject the null hypothesis that subjects are responding at random.
  • But notice, subjects are correct  less often than "therapeutic touch" model would predict. What do we do with this? (That's an interesting problem.)
  • Chi-square test.
    • An alternative test is to use the chi-square distribution.
    • We calculate the number of correct and incorrect responses we would expect if the null were true, and then we compare that with the number of obtained responses.
    • If H0 were true, we would expect 140 correct responses and 140 incorrect responses.
      Correct Incorrect Total
      Observed 123 157 280
      Expected 140 140 280
    • This statistic follows a chi-square distribution, which depends on the degrees of freedom. For the goodness-of-fit test, the degrees of freedom = c - 1, where c is the number of categories. So we have 2-1 = 1 df for this case.
    • Such a distribution is shown below, though I had to draw in the curve for 1 df by hand. This one came from David Lane's Hyperstat site:
    • You could do much worse than wasting a few minutes looking at David Lane's site. It is pretty impressive.
    • wpe3.jpg (8424 bytes):

    (My distribution for 1 df should be displaced a bit to the left and down.)

    • You can probably guess from this curve for 1 df that a chi-square of 4.13 or greater is not very likely.
      • In fact, the cutoff of the 5% area with 1 df = 3.84.
      • We will again reject our null hypothesis because 4.13 > 3.84.
      • We can conclude that the number of correct guesses departs from what we would expect under the null.
        • But keep in mind that we are rejecting the null because there are too few correct choices.
  • Chi-square and z
    • In the case where we have only two categories (right and wrong), the z test and the chi-square test turn out to be exactly equivalent, though the chi-square is by nature a two-tailed test..
      • The chi-square distribution for 1 df is just the square of the z distribution.
        • sqrt(4.129) = 2.032)
      • Note that we had a z = -2.032. Squaring this gives 4.129, which, within rounding, is our chi-square.
      • Note also that the critical value of  z = 1.96, which, when squared, = 3.8416 = the critical value of chi-square.
    • Then why bother with chi-square?
      • The equality only holds if we have 1 df.
      • If we asked subjects to say "right", "left", or "middle", and if Emily chose to put her hand in those three positions 1/3 of the time, then we would have expected frequencies of 93.33, and would have 2 df for our chi-square. I'm not going to show that here, but it is just a simple extension of what we have done. The z test would no longer apply.

Contingency Tables

Again, present this as a problem looking for a solution, rather than a solution looking for an example.

I'm going to start with the results of last Thursday's lab

  • We first generated data with the null hypothesis true for Siegel's study. 
  • This was the study in which rats were given injections of morphine in the same, or different, environment in which they had built up tolerance. 
  • Siegel's data are categorized on two dimensions (Group 1 or Group 2, and Survive or Die).
  • His actual results are shown below:





Group 1




Group 2









  • If context makes no difference, the probability of the survival should be the same in both groups. 
  • We made that probability = .5333 because that was Siegel's overall survival rate.
    • Notice that we are replicating the results to be expected when the null is true.
  • We generated 15 chi-square values per student, for 135 values overall.
  • The results follow--they are pooled across the last three years.:

Notice the shape of this distribution. It decreases at a negatively accelerated rate. Very few of the values are greater than about 4. In a perfect chi-square it would lie at 3.84.

Notice how this distribution so closely matches the previous chi-square distribution with 1 df.

I'll come back to what this would look like when the null is false later.

Another Example

  • Friedman, Katcher, Lynch, and Thomas (1980) did an interesting study on the effect of having a pet for people recovering from heart attacks. I don't recall whether they supplied a pet or just found people who did, and did not, have pets.
    • What difference would this make from a methodological perspective?
  • They found 92 people who had recently had a heart attack, and classified them in terms of whether or not they had a pet. They then determined whether these people were alive one year later.
    • Here we have two variables of classification:
      • Pet (yes/no)
      • Alive/Dead
      • Notice that in this case the row frequencies are not equal. That is not a problem, and, in fact, it's kind of nice that so few people died.
      • We want to test the null hypothesis that Pet and Survival are independent.
          Yes No Total
        Alive 50 28 78
        Dead 3 11 14
        Total 53 39 92

    • The next task is to find the expected frequencies if subjects fall in cells at random within the constraints of the row and column totals..
    • If rows and columns are independent, the multiplicative law of probability tells us that the probability of falling in row1 col1 = the product of the probability of row1 times the probability of col1.
    • p(row1 ) = freq(row1 )/N = 78/92 = .848
    • p(col1 ) = freq(col1)/N = 53/92 = .576
    • Then, the p11 = .848*.576 = .4885.
    • If there are 92 subjects overall, then .4885*92 = 44.94 would be expected to fall in cell11
    • We can put this into a formula:
      • E11=(Row1*Col1)/N
      • Or, in the general case, Eij=(Rowi*Colj)/N
    • For Row2Col1 E21=14*53/92 = 8.07
    • Filling in the rest of the cells, we get
    • Expected Frequencies
      •   Pet  
          Yes No Total
        Alive 44.94 33.06 78
        Dead 8.07 5.93 14
        Total 53 39 92

    • We will use the same formula for chi-square, but this time we will calculate it over the four cells of the table.

    • For a contingency table, df = (R-1)(C-1), which is this case is 1.
    • We already know that with 1 df the critical value of chi-square = 3.84.
    • So we will reject our null hypothesis and conclude that there is a relationship between having a pet and living for a decent length of time after a heart attack.

More Complex Contingency Tables

  • These are data from Jody Kamon (1998), but I don't recall where she got them.
  • The experiment involves the relationship between problem behavior in children and their parents.
  • Kamon (?) classified parents with respect to whether they exhibited Antisocial personality Disorder (APD)
  • She also classified children with respect to whether or not they were diagnosed as Conduct Disorder (CD), Oppositional Defiant Disorder (ODD) or no problem (Control)
  •  Observed Frequencies
      Child's Diagnosis  
    Parent's Diagnosis CD ODD Control Total
    APD 27 16 3 46
    Non-APD 41 54 36 131
    Total 68 70 39 177

  • We calculate the expected frequencies in exactly the same way we did above. These are given in the following table.
  • Expected Frequencies
      Child's Diagnosis  
    Parent's Diagnosis CD ODD Control Total
    APD 17.67 18.19 10.14 46
    Non-APD 50.33 51.81 28.86 131
    Total 68 70 39 177

  • Here we have (2-1)(3-1) = 2 df.
    • The critical value of chi-square with 2 df = 5.99
    • Again we will reject the null hypothesis and conclude that the diagnosis of the child is not independent of the diagnosis of the parent.
      • If we look at the data we see that children are more likely to be diagnosed as  CD or ODD if their parents have a diagnosis of APD.
      • We could check this better if we combined the CD and ODD cells, which I'll do next with SPSS.
  • Notice that I slipped into a table larger than 2 X 2 without any problem.

SPSS Analysis

  • First we need to create the data file without combining anything..
    • Enter a column for Child and a column for Parents
    • You can enter CD, ODD, etc instead of numbers, but you need to create all six cells.
    • Then create a column called Freq (or whatever) and enter the cell frequencies.
    • Go to the Data menu entry and select "weight cases" Tell it to weight cases by freq.
    • Analyze/Descriptive statistics/CrossTabs, put child on columns and parents on rows.
    • Be sure to click on statistics and tell it to compute chi-square.
    • The data look as follows:

    • The printout would be as follows.

That chi-square is, within rounding, the same as we calculated above.

Then I recoded Child into NewChild, making CD and ODD into Problem and leaving Control as Control.

(Note: I had to specify that the new variable was a string variable.)

The results follow:

Interpret this result.


Measuring the size of an effect

 One of the most important recent developments in behavioral statistics is the emphasis on effect sizes in addition to (if not in place of) statistical hypothesis tests.

When it comes to contingency tables, perhaps the best measure is the odds ratio--especially for a 2 X 2 table.

Odds Ratios

  • For 2X2 tables this is one of my favorite topics.

  • Define Odds: (# positive outcomes)/(# of negative outcomes)

  • Using Jody's study with the data for kids reclassified to Problem and No Problem.

    • The odds of being a Problem given that you have a parent with APD are
      • 43/3 = 14.333
    • The odds of being Problem given that you do not have a parent who is APD are
      • 95/36 = 3.639
    • Notice that odds are conditional on something, just like conditional probabilities
    • Notice that odds are not a proportion or a probability, because the denominator is not the total, but the number in the other category.
    • Clearly the odds of being a problem, given that you have parent with APD are higher than the odds of being a problem given that parent is not APD. But, how much higher?
    • The odds ratio is just what it sounds like, a ratio of odds.
    • The odds ratio here is 14.333/3.639 = 3.94
    • This can be interpreted to mean that you are almost 4 times more likely to be a Problem if your parent was classed as APD. That's pretty impressive.
  • Going back to the Ritonavir example that we used about two weeks ago,



      Died or worse














    • Chi-square = 33.18, which is clearly significant.
    • The odds of dying in the Ritonavir group = 71/472 = .15
    • The odds of dying in the Placebo group = 148/399 = .37
    • The odds ratio  of the Placebo relative to the Ritonavir group = .37/.15 = 2.47 ~ 2.5, meaning that an AIDS patient was about 2 1/2 times more likely to die if he/she was in the Placebo group than in the Control group,
    • If we took the ratio the other way around, as .15/.37 = .405, we would say that your odds of surviving are only 40.5% as high if you are in the placebo group than if you are in the Ritonavir group. That is simply 1/2.47
    • Point out that you would be talking about the same odds, and the same odds ratios (in reverse) if you talked about surviving. 
      • Odds surviving in Ritonavir = 472/71 = 6.648 = 1/.15.
      • Odds surviving in Control = 399/148 = 2.696 = 1/.37 
  • Comment on odds ratios with larger contingency tables.

Likelihood ratio tests

  • This is an alternative way of calculating a c2 statistic as a test of the null hypothesis.

    • Some evidence that it is a better statistic than Pearson’s Chi-sq. with small sample sizes, but I doubt that. The following is a quote from Agresti, 1990 (p. 49)
    • "When independence holds, the Pearson statistic c2 and the likelihood-ratio statistic G2 have asymptotic chi-squared distributions with df = (I-1)(J-1). In fact, c2 and G2 are asymptotically equivalent in that case: c2 - G2 converges in probability to zero. The limiting results for multinomial sampling also apply to the other sampling schemes...

      "It is not simple to describe the sample size needed for the chi-squared distribution to approximate well the exact distributions of c2 and G2. For a fixed number of cells, c2 usually converges more quickly than G2. The chi-squared approximation is usually poor for G2 when N/IJ < 5. When I or J is large, it can be decent for c2 for nij as small as 1, if the table does not contain both very small and moderately large expected frequencies...."

  • Likelihood ratio chi-square is heavily used in log-linear models, which are like Anova for categorical data.

  • I give the formula in the text

Explain formula

Do this by hand on the AIDS data.

















c2 = 2[472*ln(1.088) + 71*ln(0.651) + 399*ln(0.913) + 148*ln(1.347)]

= 2[16.99] = 33.99

which is quite close to the Pearson chi-square of 33.18.

The SPSS output is

Last revised: 10/01/01