
I want to spend quite a bit of time going back over what I wanted you to get out of the lab on Thursday. I'll go quickly, because I spent a fair amount of time on these specific points on Thursday.
First Example:
- This was a study of the life expectancy of people in different occupational clusters, based on the smoking behavior of the clusters.
- Ask if there are other variables that might control mortality for these groups other than smoking.
- It is possible that occupations tend to cause both smoking and early death--e.g. stressful aspects of the job.
Data:
Occupational Group Mortality
Smoking
Farmers, foresters, and fisherman 77
84
Miners and quarrymen 137
116
Gas, coke and chemical makers 117
123
Glass and ceramics makers 94
128
Furnace, forge, foundry, and rolling mill workers 116
155
Electrical and electronics workers 102
101
Engineering and allied trades 111
118
Woodworkers 93
113
Leather workers 88
104
Textile workers 102
88
Clothing workers 91
104
Food, drink, and tobacco workers 104
129
Paper and printing workers 107
86
Makers of other products 112
96
Construction workers 113
144
Painters and decorators 110
139
Drivers of stationary engines, cranes, etc. 125
113
Laborers not included elsewhere 133
146
Transport and communications workers 115
128
Warehousemen, storekeepers, packers, and bottlers 105
115
Clerical workers 87
79
Sales workers 91
85
Service, sport, and recreation workers 100
120
Administrators and managers 76
60
Professionals, technical workers, and artists 66
51
First I want to look at the scatterplot.
- There are several points to make about this figure.
- Note the range of the mortality values. The represent the life expectancy for that occupation.
- That is a huge spread.
- Remember that a score of 60 means that their life expectancy is 60% of average
- Note that mortality increases with smoking.
- The relationship is quite linear.
- I'm not worried about the heterogeneity of variance. Our sample is much too small to know whether or not that is a problem.
- Note that the correlation = sqrt(.513) = .716.
- This just repeats information that we already knew (except for the st. error of estimate, which I'll return to).
- Because each point represents a group mean on that variable, the correlation is higher than we would expect if each point represented an individual observation.
- That is directly of the result of the CTL, which states that the
.
- The following figure is completely hypothetical, but it is intended to represent the point I am making. I small red dots represent the individual observations, while the large green dots represent the means at each level of smoking.
- The squared correlation that is shown is for the small individual dots. The squared correlation for the means would be .984.
- It is important to recognize that "composite" data do not necessarily reflect the results you would get from individual observations.
- That does not in any way invalidate this study; it just says to be careful with your interpretation.
- Testing a null hypothesis
- This is the analysis of variance testing the null hypothesis that there is no relationship between smoking and mortality.
- H0 : r = 0
- Elaborate on this.
- The F statistic is a test on this null hypothesis. If the null is true the F would be somewhere around 1 - 3. Here F = 24.228 on 1 and 23 df.
- The probability of an F this large or larger, if the null hypothesis were true, is .000 (actually it is .000057)
- With a probability that small, we will reject the null and conclude that there is a linear relationship between smoking and mortality.
- This means that we will conclude that the two variables are not independent.
- [As an aside, for my extreme fictitious data, the test on the individual observations would not be significant: p = .07, whereas the test using means would certainly be: p = .001.]
- Note that the analysis of variance table contains the sums of squares for regression and for residual (error).
- SStotal is the sum of squared deviations of Yi around the mean of Y.
- Some of that variability is random noise, and some of it is real--due to the fact that some occupations smoke more than others. We are going to break those two parts out.
- SSregression is the sum of squared deviations of the predicted values of Yi around the mean of Y. In other words, this is the variability in Y that we would expect to see because of variability in X.
- SSresidual is the variation of the obtained Yi values around the predicted Y values.
- Notice that Y - Yhat are just the usual residuals--the degree to which our prediction did not match up with what we got. We just square and sum these.
- The better our prediction, the smaller the residual. The smaller the residual, the smaller the sum of squared residuals.
- What if we took the standard deviation of the residuals?
- This would be
- This is the same value (and the same statistic) as the standard error of estimate reported above.
- I'll say something later about why the denominator is N-2 instead of N-1.
Note that the F statistic is simply MSregression / MSresidual
Define MS
r2 (the coefficient of variation)
If SStotal is all of the variation in Y
and
SSregression is that variation in Y that can be attributable to variation in X, then
Thus we say that r2 is the percentage of variability in Y that can be accounted for by variability in X.
The Regression Equation
I talked in class about Y = bX + a
b is the slope
The rate at which the line rises or falls
The change in Y for a one unit change in X
a is the intercept.
The value of Yhat when X = 0.
SPSS gives us these values, except that they label them differently.
a = "constant"
b = "variable name" = "Smoking"
In this case Yhat = 0.472*Smoking + 51.464
As smoking increases by 1 point, mortality increases by .472 points.
So, if 100 is the average rate of smoking
Yhat(100) = .472*100 + 51.464, which would yield a predicted mortality score of 98.664.
That says that someone with an average rate of smoking doesn't quite have an average rate (100) of mortality.
That bothered me until I realized that for our data the two averages are not 100.
Let X = 109.00, which is the mean of smoking for our data.
Then Yhat = 109*0.472 + 51.464 = 102.9, which is the mean mortality in the sample.
Suppose some occupation has a mean smoking score of 119, which is 10 points above the mean of the whole group.
Then Yhat = 119*0.472 + 51.464 = 107.63 = 4.72 points above average for the sample.
This is just 10.0*0.472.
Statistical Significance in Regression Equation
We can test the slope and intercept for significance, the same way we do the correlation coefficient.
The table above gives those tests as t tests.
This is the normal Student's t test, but applied to these coefficients.
We can see from the table above that the intercept has a t = 4.796, and the slope has a t = 4.922.
Both of these are significant--meaning that both coefficients are significantly different from 0.
Explain what this means.
If the slope is not zero, it means that we predict that mortality will go up is smoking goes up.
But this is just saying that there is a relationship between smoking and mortality, which is what the correlation told us.
Not surprisingly, the test on the slope and the test on the correlation are the same thing.
They are tested by an F and a t, but in this case the F is just the square of the t.
24.228 = 4.9222
Residuals
First of all, we can use SPSS to generate all of the predicted values and the residuals.
They did this is on Thursday.
The predicted values are just the values that we get from the regression equation.
The residuals are just the difference between actual mortality and predicted mortality
e.g. 77.00 - 91.08726 = -14.08726
SPSS gives you the option of "standardized residual," which are just the residuals divided by their standard deviation.
The average residual will be 0.00, because we draw our line to underestimate as much as we overestimate.
The variability of our residuals is our "residual variance"
If we took the variance of res_1 it would be the same as MSresidual except that it would be divided by N - 2 instead of N - 1.
![]()
If you multiply 144.043 by (24/23) you get 150.306 = MSresidual
The residuals are what are left over after we predict Y from X, so they are that part of Y that cannot be predicted from X. That means that they are independent of X.
This will be important next semester when we talk about the analysis of covariance, where we basically (not exactly) adjust a variable by taking its residual when predicted by a "nuisance" variable that we want to control. This gives us a variable that is independent of our nuisance variable.
If we correlate our variables we get:
Discuss the pattern in this table.
The data we had on Thursday were the data from students who had not read the passage before answering.
Why would we care about the correlation?
If the test scores did not correlate with SAT, they are measuring something quite different than the SAT is measuring. To the extent that they do correlate, we are measuring some of the same skills.
The correlation matrix follows:

The scatterplot comes next.

Discuss which variable should go on which axis.
The correlation (r = .532) is significant, though not as large as in the previous example.
I ran a resampling program on the correlation coefficient.
The results follow:
Notice that the CI does not include 0.
Notice how wide the CI actually is.
Notice the skewness in the sampling distribution.
The regression is given in the following table.


- The F tells us the same thing as the test on the slope.
- r is significant at a = .05
- The regression line is given by Yhat = 0.058*Test + 11.423
- The Standardized Regression coefficient (b)
- This is the coefficient we would get if the data were standardized first.
- It means that a one standard deviation change in Test is associated with a .532 standard deviation change in SAT.
- With one predictor it will always be equal to r, but not when we have multiple predictors.
Comparing Correlations
- For Katz study, the correlation when they did not read the passage was .532, and the correlation when they did read the passage was .691. The sample sizes were 38 and 17, respectively.
- I gave the wrong value for correlation in no-passage condition in the lab that I handed out. I have since changed them.
- Discuss the sample distribution of the difference between two independent correlations.
- Fisher's transformation:
- where tanh = hypertangent
This is not significant.
- Interpret this result.
Last revised: 11/03/01