
Correlation and Regression
10/30/01
Announcements:
- Hand back exams
- Hand back any assignments
Introduction
- We have been looking at differences between means and at the chi-square test of the
independence of two variables.
- Now we are going to look at the relationship between two variables.
- Two common examples are the relationship between Beta-endorphin levels 12 hours before
surgery and 10 minutes before surgery. Are high levels at one reading
associated with high levels at the other? (We ran a t test on these data about two
weeks ago.) The second example is the relationship between SAT scores and
performance on an SAT-like test when the subjects have not read the passage on which the
questions are based.
Prediction and Relationships
- We want to ask if Y is some function of X, where X and Y
are two different variables.
- Discuss differences between correlation and regression
- Correlation is the word we usually use when we want a single measure of the degree
of relationship between two variables.
- Regression is the word we usually use when we want an equation relating the variables.
- When we have only one predictor, the two approaches tend to blur into one--we almost
never use regression without also speaking of the correlation coefficient. When we have
multiple predictors, we are much more interested in the regression side of things.
- Y is almost always thought of a as a dependent variable beyond the
experimenter's control.
- In regression, X is usually (traditionally) thought of as a fixed variable,
even when it really isn't.
- This is called the linear regression model.
- In correlation, X is usually thought of as a random variable.
- This is called the bivariate normal model.
- I'm deliberately using a small sample example just to keep things simple. But don't get
the idea that small samples are a good idea.
- The following data refer to beta-endorphin levels 12 hours and 10 minutes before
surgery. Notice that they are paired by patient. (These are real data.)
Subject |
12 Hours
Before |
10 Min.
Before |
Gain |
| 1 |
10.0 |
20.0 |
10.0 |
| 2 |
6.5 |
14.0 |
7.5 |
| 3 |
8.0 |
13.5 |
5.5 |
| 4 |
12.0 |
18.0 |
6.0 |
| 5 |
5.0 |
14.5 |
9.5 |
| 6 |
11.5 |
9.0 |
-2.5 |
| 7 |
5.0 |
18.0 |
13.0 |
| 8 |
3.5 |
6.5 |
3.0 |
| 9 |
7.5 |
7.5 |
0.0 |
| 10 |
5.8 |
6.0 |
0.2 |
| 11 |
4.7 |
25.0 |
20.3 |
| 12 |
8.0 |
12.0 |
4.0 |
| 13 |
7.0 |
15.0 |
8.0 |
| 14 |
17.0 |
42.0 |
25.0 |
| 15 |
8.8 |
16.0 |
7.2 |
| 16 |
17.0 |
52.0 |
35.0 |
| 17 |
15.0 |
11.5 |
-3.5 |
| 18 |
4.4 |
2.5 |
-1.9 |
| 19 |
2.0 |
2.0 |
0.0 |
- We could run a t test here, but we did that before.
- It would address an entirely different question.
- We would presumably like to look at the relationship between people's beta-endorphin
scores at the two times.
- Did people who started out high stay high?
- What would it mean if they didn't?
- The first thing we could do is to plot the data.
- The 10 min. data go on the ordinate, because it is logical to predict
forward, not backward, in time.

- Here we see that there is a positive relationship between the two variables--we'll talk
about significance later.
- If we want a measure of the degree of this relationship, the correlation is 0.699
- As we'll see later, the relationship is significant.
- What does that mean?
- In this particular example both of the variables are random--we don't know
what the values of X, or Y, will be before the experiment
begins.
Example with Fixed X
- This is really a regression problem.
- Data from Langlois and Roggman (1990) on page 411 of the text.
- Describe study
- Here I have entered 1, 2, ..., 5 for the power of 2 concerning the number of
pictures that were averaged. I have used the mean rated attractiveness of the photographs.
Condition |
Attract |
1 |
2.201 |
1 |
2.411 |
1 |
2.407 |
1 |
2.403 |
1 |
2.826 |
1 |
3.380 |
2 |
1.893 |
2 |
3.102 |
2 |
2.355 |
2 |
3.644 |
2 |
2.767 |
2 |
2.109 |
3 |
2.906 |
3 |
2.118 |
3 |
3.226 |
3 |
2.811 |
3 |
2.857 |
3 |
3.422 |
4 |
3.233 |
4 |
3.505 |
4 |
3.192 |
4 |
3.209 |
4 |
2.860 |
4 |
3.111 |
5 |
3.200 |
5 |
3.253 |
5 |
3.357 |
5 |
3.169 |
5 |
3.291 |
5 |
3.290 |
Notice that there is no sampling error in X, whereas there was in
the previous example.
What does that statement mean?
The scatterplot for these data is given below.
Notice how judged attractiveness increases with the number of faces
included in the composite.
Notice how the variability of data points decreases
as we increase X. This is a no-no from the point of view of assumptions behind
correlation and regression. It will also be a problem with the analysis of variance.
The correlation is about the same as in the previous example--r
= .56, and it is significant.
Third Example--Smoking and Low Birthweight.
I chose this example because it is
one that psychologists deal with, and relates to an important health
problem.
The question is the relationship
between age and low-birthweight (we know they are related), and what happens
when mothers do, and do not, smoke.
Data from http://www.healthystart.net/factualcharts/1999stats/lbwbyraceandorigin.htm
Data on Smoking mothers (pooled
across 48 states, dv = % low birthweight.

Mothers who do not smoke:

Notice several things:
-
Neither relationship is
exactly linear, though we get away with a straight line in the first
one.
-
Both relationships are
essentially the same, but exaggerated
-
Notice the difference in the
mean %.
-
I don't quite know what to
make of these data, but they are interesting.
-
If you get pregnant,
don't smoke--especially if you are old and creaky.
-
I'm not above a little
drum beating.
-
To beat another drum, Minitab's month web page
just reported
"According to "The World's Women 2000: Trends and
Statistics" (a United Nations compilation of the latest
data documenting progress for women worldwide), an African
woman's lifetime risk of dying from pregnancy-related causes is
1 in 16; in Asia, 1 in 65; and in Europe, 1 in 1,400."
Final Example--Breast Cancer as a function of
Solar Radiation
- These data were taken from Newsweek from a 1991
article.
- One of my favorite examples.

The Correlation Coefficient
- The covariance is a measure of how two variables vary
together, but it is an "unscaled" measure.

- This is a definitional formula, and we probably won't see it again.
- Discuss the correlation coefficient and its calculation.
- Why is this one negative?
The Adjusted Correlation Coefficient
- I want them to know what this is, but I don't want them to
go away thinking that we use if very often. (We rarely do).
- What we want is an unbiased estimate of the correlation in
the population.

- Comment that we very rarely use the adjusted coefficient,
even though most programs print it out.
The Regression Line
- Here we are looking for the best straight line that can be
fit to these data.
- I have included those lines in the plots above.
- We want an equation of the form:
where b = slope and a
= intercept
Define slope and intercept.
- This is a general equation for any straight line.
- We solve a set of equations for a and b
such that
is a minimum,
- There are an infinite number of lines with that slope, and
another infinite number of lines with that intercept, but only one line with both that
slope and intercept.

SPSS Analysis of these Cancer and Solar Radiation Data



- Discuss all parts of this printout:
- Include the Anova table and explain what's going on
- Ask what an intercept of 0 would mean. (In this case I
can't imagine that it would mean much, because I can't imagine a case where solar
radiation really = 0.)
- Discuss the slope
- What if the slope were greater or less than it is?
- What if the slope were 0?
- What if we were plotting the same general variable on both
axes (as we did with endorphins) and we had a slope = 1.0. What would that mean?
- Point out the tests on these coefficients.
- Go back to the regression line and discuss "least
squares."
Before Thursday they should read
Chapter 9, and pay special attention to Fisher's transformation of r, and
how that transformation allows us to test hypotheses.
Last revised: 10/29/01