
This lab was one I used last year, but people had a lot of trouble with it--I went overboard. So I have cut it back this year, and think that I have gotten away from a lot of the nerdy stuff.
This exercise is designed to give you some hands-on understanding of the treatment of categorical variables within multiple regression, and then experience with how we can take standard grouping variables and use them in multiple regression to accomplish the analysis of variance. This treatment is known as the general linear model (GLM).
While it is important that you be able to deal with categorical variables in multiple regression, I am not concerned that anyone ever be prepared to sit down and run a simple analysis of variance in this awkward way. I want you to understand how the analysis of variance relates to GLM, so that you can better understand what the Analysis of Covariance is all about.
To quote from the descriptive document ( http://www-unix.oit.umass.edu/~statdata/stat-rmult.html ), where I got these data:
"Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman's behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.
"The variables identified in the code sheet given in the table (you don't have that) have been shown to be associated with low birth weight in the obstetrical literature. The goal of the current study was to ascertain if these variables were important in the population being served by the medical center where the data were collected.
These are real data, not simulations, and can be found in btwtHosLem.sav. They were taken from Hosmer and Lemeshow's (1989) book on logistic regression, though we are not going to run a logistic regression. Hosmer and Lemeshow collected data from 189 births in 1986. They recorded such things as the birthweight of the infant, the mother's weight, the number of physician visits in the first trimester, whether the mother smoked or had other problems, and the race of the mother. You can see that these variables are continuous, dichotomous, and categorical, in turn.
The analysis of variance
You can find the answers to this lab at answers.htmlFirst run an analysis of variance on birthweight as a function of Race.
While you are at it, run a contrast on White vs. Black. The simplest way of doing this is to use the One-way procedure and use 1, -1, and 0 as the contrast coefficients.
Now create two dummy variables to carry the information about Race. Name one of them White and the other Black. (We will ignore the "Other" category for now, because if race isn't Black or White, it must be Other.), Just code the case as a 1 if the subject falls in that race, a 0 if they do not, and a -1 if they are in the Other race. (You may find it easiest to do this using the Transform/Recode/Into Different command--twice.)
Now run a multiple regression using White and Black as the two predictors, and Birthweight as the dependent variable.
Examine the resulting statistics and compare them with statistics from the analysis of variance. Find as many overlaps as you can.
Now select if Race is not Other, and predict birthweight from White. How does that relate to the contrast in the analysis of variance. Note that in both cases we have excluded Other.
One of the questions that sometimes comes up concerns whether the intercept in the regression was the grand mean (weighted) of all of the data points, or whether it was an unweighted mean of the White, Black, and Other means. You are in a position to answer that question. What is the answer?
Now go back to the Anova and add in Smoking and mother's weight as covariates. Be sure to use the options button to ask for both descriptive statistics and estimated marginal means (they won't be the same here).
Now do the same thing using regression, with Smoking, Mother's Weight, and the dummy variables you have created. But this time do it in a hierarchical fashion, using Smoking and Mother's Weight in the first block, and the dummy variables in the second.
Calculate the difference in the SSregression between the two blocks, and compare that to the results of the analysis of covariance.
Test the difference between R2 for the full and reduced models (with and without all the variables). You can use the following formula.
![]()
How does that answer compare to the F from the analysis of covariance?
Look at the estimated means. Why do they differ from the means that we have seen before?
One problem with the analysis of covariance concerns group differences on the covariate. Do our groups differ on either covariate? What effect would you suppose this would have?
Last revised: 04/15/02