The General Linear Model
There are three reasons for covering this material.
- This material provides an introduction to the use of "dummy" variables.
- These variables are very useful whenever you have a categorical variable, and are actually more useful in standard multiple regression.
- This material emphasizes the importance of models
- It causes us to think about how we want to go about testing models, and the alternatives ways that we can look at problems.
- It makes it much easier to talk about the analysis of covariance, and related techniques, and to talk about unequal sample sizes and how we want to test them.
This topic starts out as a more difficult way of doing what they already know how to do. But it then goes on to present other stuff in a much simpler way than it could be presented in any other way.
I have tried to remove much of the stuff that doesn't focus on the three reasons that I gave above. I want students to understand the general concepts, and be able to see that they could be applicable in other settings. I am not trying to show people a harder way to run an analysis of variance.
The approach taken here is basically the approach that any statistical package takes, which may help explain some of the subtleties of those packages.
A major subtheme is to show that the analysis of variance, the analysis of covariance, the analysis of multiple regression, and a whole bunch of other things are just variations on a common theme.
I want students to understand the basic idea of coding (dummy) variables, but the specifics are not important.
I am going to use one of the smoking examples from Spilich that we have seen in other contexts. The data file (Spilich.sav) contains data on all groups, but we are only going to look at the group that was given a standard recall task--a cognitive task.
Three basic groups.
- Nonsmokers (people who never smoked)
- Delayed smokers (Smokers who had not had a cigarette for several hours)
- Active smokers (Smokers who smoked during the task.)
The dependent variable was the number of errors made during the recall task.
Standard Analysis of Variance:
Grand mean = 38.778
Plot with error bars (bars represent 95% CI)
It is clear that there are significant differences between groups. I will even go ahead and compare the Non-smokers with the combined smoking groups, and then the two smoking groups with each other. This is for comparison purposes later.
I did this with the one-way procedure and standard contrasts.
Here we can see that Non-smokers differ from smokers, but that the two smoking groups do not differ between themselves.
The GLM approach
First we need to code the data to indicate Groups.
- We already have Groups as 1, 2, and 3, but we are going to do it differently.
- The reason that we have to do it differently is due to the fact that our coding is completely arbitrary. We could have coded them as 2, 1, and 3. Any regression against group membership would be entirely dependent on the order in which we coded--that's a bad thing..
- We will set up dummy variables that tell us whether a subject is in Group 1 or not, and whether he/she is in Group 2 or not.
- I have called these new variables NonSmoke and Delayed, because they identify those who are in those two groups.
- We don't need to code for Group 3, because it you're not in 1 or 2, you must be in 3.
- The "filter" variable below just selected the Cognitive task, and ignored the other two tasks.
Task Group Errors distract filter NonSmoke Delayed
2.00 1.00 27.00 126.00 1 1.00 .00
2.00 1.00 34.00 154.00 1 1.00 .00
2.00 1.00 19.00 113.00 1 1.00 .00
2.00 2.00 48.00 113.00 1 .00 1.00
2.00 2.00 29.00 100.00 1 .00 1.00
2.00 2.00 34.00 114.00 1 .00 1.00
2.00 3.00 34.00 108.00 1 -1.00 -1.00
2.00 3.00 65.00 191.00 1 -1.00 -1.00
2.00 3.00 55.00 112.00 1 -1.00 -1.00
(Explain why I used -1 for each dummy variable for people in the last group.
This makes the intercept come out to be the grand mean, and expresses the results in distance from the grand mean, rather than distance from the mean of some arbitrary group.
This idea is important, because if we aren't careful it is easy to get answers to tell us about deviations from some single group, and that usually isn't what we are after.
Here we come to the first important idea. I have taken a categorical variable with 3 (k) levels and turned it into 2 (k-1) new variables. These two variables carry all the information that the single variable did, and are more useful.
Regression Approach using Dummy Variables
I will now simply predict Errors using Nonsmoke and Delayed as my predictor variables. This is a standard multiple regression.
Look first at the Anova test for the regression
F = 4.744, p = .014
This is exactly the same result we got when we ran the traditional Anova.
Explain why this should be.
Look next at the R2 value = .184. This is nothing but eta-squared
Go to the table headed Coefficients
The following (up to the next major heading) is material that I find important and helpful, but if it adds to information overload, set it aside for now.
Note that the Intercept = 38.778. This is exactly equal to the grand mean of all the groups.
Note that the slope for Nonsmoke = -9.911. This is exactly equal to the difference between the Nonsmoke mean and the grand mean.
Note that the slope for Delayed = 1.157. This is the difference between the Delayed mean and the grand mean.
Why not have a slope for Active???
It would be redundant--if we know the grand mean and the deviation of the other two groups, we can compute the deviation of the 3rd group.
The sum of the deviations from the mean = 0. So, the deviation of the third group is 0 - (-9.911) - 1.156 = 8.755
If I had coded for Active and Delayed, and left out NonSmoke, I would get an intercept of 38.778, slope for active = 8.755, slope for Delayed = 1.157, and could compute slope for NonSmoke = -9.911. This illustrates that the choice is arbitrary and unimportant.
I forgot to do one additional thing, so I went back and did it. I asked SPSS to compute "deviation contrasts" when it ran the Anova.
Deviation contrasts are comparisons of each mean with the grand mean. (Again, it doesn't do all three--it leaves out one, which in this case was the last one.)
Note that the tests and the probabilities are exactly the same as the tests (and probabilities) on the regression equation.
Why should this be?
What is all of this about?
I want to show that Anova and Regression are basically the same procedure. The only difference here between this regression and standard multiple regression is the use of dummy variables.
There are a lot of important things here, but their importance doesn't show up until we move to more complex analyses.
Now things get interesting.
First, we will take the same example, but with all three tasks, and create dummy variables for the different tasks as well. (Again, we create dummy variables for only two of the tasks.)
Then we create interaction dummy variables my multiplying our dummies together to create 4 new variables.
Nonsmoke*Patrec, Nonsmoke*Cogit, Delayed*Patrec, and Delayed*Cognit
The overall Factorial Anova follows:
We will start with the complete multiple regression using all dummy variables as predictors. Here we are trying to explain variance in errors as a function of everything we know about groups, tasks, and their interactions.
Comment on SSregression as being equivalent to "Model" in regular Anova (Explain why 8 df.)
Comment on the error term.
This error term is all of the variance in "errors" than can not be explained on the basis of groups, tasks, or their interactions. This is the standard error term in the factorial analysis of variance.
Now students should understand why SPSS presents the Anova summary table the way it does, even if that is a confusing way to have chosen to present it.
Removing the Interaction Terms gives:
The difference in the SSregression is 31744.726 - 29016.074 = 2728.652.
This is the SS for the interaction term in the Anova.
Removing the Task Terms (after replacing interaction) gives:
If we subtract this SSregression from the SSregression in full model, we get
31744.726 - 3083.200 = 28661.526
This is the effect of Task
Lastly, look at the model with dummy variables for Task and Interaction, but no dummy variables for Condition
Here the difference between the full and reduced models is
31744.726 - 31390.178 = 354.548
This is the effect of Condition.
Notice that each of this is basically what we called a hierarchical model earlier. The difference between the full model and a reduced model is what the extra variable(s) explain over an above (controlling for) the other variables.
From here I get the following models:
Error Main effects
Interaction Task. + Interaction
Condition Cond + Interaction
But we aren't done.
Yes we are for class. I have left the rest in for completeness, but I don't think that it adds much for clarity the first time through.
Now we will go back to the full model.
(Note: the table lists Inter4 before Inter3--just due to the way I happened to create them.)
This gives us the regression equation
Y = -2.193Nonsmoke + 0.519 Delayed -8.615Patreg + 20.519Cognit +1.948Inter1
-7.719Inter2 - 0.563Inter3 + 0.637Inter4 + 18.259
Suppose that we take a subject from cell11. This person would have the following data for these variables. 1 0 1 0 1 0 0 0
Therefore, his/her expected score would be
Y = -2.1931(1) + 0.519 (0) -8.615(1) + 20.519(0) +1.948(1)
-7.719(0) - 0.563(0) + 0.637(0) + 18.259 = 9.391 =?= 9.400
In other words, we have reproduced the mean of cell11, which is what we wanted to do.
I will quit here for now.
(I'll then go on to unequal sample sizes and the analysis of covariance.
Last revised: 04/15/02