GreenBlueBar.gif GreenBlueBar.gif

R Code for Chapter 9

Correlation


The first two sets of code below plot Figure 9-5 and then Figure 9-8. In Figure 9-5 you see something that you have not seen before. You have the command "reg <- lm(Heart ~ Pace)", followed by the line "abline(reg = reg, col = "red")". Let's start with the second one. That draws what is called a regression line through the data. It is simply the best fitting line for those data points. But where does that line come from? How does R know what line to draw? It obtains it from the previous command. That line just tells R to fit the regression of Heart predicted from Pace. Now here is something new. When I tell it to calculate "reg <- lm(Heart ~ Pace)", it calculates all sorts of stuff. All that stuff is stored away in an "object" called "reg." When it gets to the next command it goes back to that object, finds the equation for the best fitting line, which has been stored in there, and uses that. But there is all sorts of other stuff in "reg" that it doesn't use, but that you will see us use, or at least look at, as we go along.

In Figure 9.8 I have drawn three plots. One for males, one for females, and one for the combined data. What I want you to see here is how I separated out the data for males and females. A command like "HtWt <- Height[Sex == 1] " tells R to create a variable called HtWt from the Height scores, but only those Height scores where the value of Sex = 1. I then do something similar for the other variables. Then I correlated Height and Weight for males. But I don't want the answer to seven decimal places, so I round it off to three. Notice how I did that.

Now I get a bit fancy and ahead of myself. The section of code that begins "mult.model <- lm(Height + Sex) " really belongs in Chapter 11, but I am putting it here in anticipation. This tells R to predict Weight from both Height and Sex at the same time. So we will see what Height can do if you control for Sex, and what Sex can do if you control for Height.

I then go on to put the data together but to use different colored symbols for males and females. I want you to see two things here. First of all you will see "par(mfrow = c(2,1)." You have seen this before, and it tells R to set up the graphics screen with two rows and one column. Then I create my plot of data for males. Then you see "par(new = TRUE). I commented about that earier. It means that I am going to make another plot, and I want that plotted right on top of the first one. (To me "new" should mean make a new plot, but it really means the opposite. Hmmm.)


#  Pace of Life  Figure 9-5
Pace.life <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Fig9-5.dat", header = TRUE)
attach(Pace.life)
correl <- cor(Heart, Pace)
correl <- round(correl, 3)
plot(Pace, Heart, xlab = "Pace of Life", ylab = "Incidence of Heart Disease", 
  col = "dark red" )
reg <- lm(Heart~Pace)
abline(reg = reg, col = "red")
legend(16, 28, paste("r = ",correl), bty = "n")

# Figure 9-8   I have plotted the three plots on separate screens
HtWt <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/
Fig9-8.dat", header = TRUE)
attach(HtWt)
Males.Ht <- Height[Sex == 1]
Males.Wt <- Weight[Sex == 1]
Females.Ht <- Height[Sex == 2]
Females.Wt <- Weight[Sex == 2]
Correl.males <- round(cor(Males.Ht, Males.Wt),3)
plot(Males.Ht, Males.Wt, xlab = "Height", ylab = "Weight", main = "Males", 
  xlim = c(61, 75), ylim = c(90, 220) )
reg <- lm(Males.Wt ~ Males.Ht)
abline(reg = reg)
legend(64, 200, paste("r = ",Correl.males), bty = "n")
Correl.females <- round(cor(Females.Ht, Females.Wt),3)
plot(Females.Ht, Females.Wt, xlab = "Height", ylab = "Weight", main = "Females",
 xlim = c(61, 75), ylim = c(90, 220) )
reg <- lm(Females.Wt~Females.Ht)
abline(reg = reg)
legend(64, 200, paste("r = ",Correl.females), bty = "n")

# Combined Data, 3
Correl.all <- round(cor(Height, Weight),3)
plot(Height, Weight, xlab = "Height", ylab = "Weight", main = "Combined",
 xlim = c(61, 75), ylim = c(90, 220) )
reg <- lm(Weight ~ Height)
abline(reg = reg)
legend(64, 200, paste("r = ",Correl.all), bty = "n")

mult.model <- lm(Weight ~ Height + Sex)
par(mfrow = c(2,1))

plot(Males.Ht, Males.Wt, xlab = "Height", ylab = "Weight", main = "Combined",
 xlim = c(61, 75), ylim = c(90, 220), col = "blue" )
legend(64, 225, "Blue = Male", bty = "n", col = "blue")
par(new = TRUE)
plot(Females.Ht, Females.Wt, xlab = "Height", ylab = "Weight", main = "",
 xlim = c(61, 75), ylim = c(90, 220), col = "red" )
reg <- lm(Weight ~ Height)
abline(reg = reg)
legend(64, 200, "Red = Female", bty = "n", col = "red")
legend(61, 212, paste("r = ",Correl.all), bty = "n")


# Figure 9.10
par(mfrow = c(1,1))
nreps = 10000
n = 500
correl <- numeric(nreps)   # Create space to store the results.
for (i in 1:nreps) {
  x <- rnorm(n, 0, 1)
  y <- rnorm(n,0,1)
  # y <- rnorm(n, 0, 1) + .30*x   # The correlations will average about .28 
  # if you remove the #.
  correl[i] <- cor(x,y)
}
hist(correl, main = "Distribution of Correlation Coefficients",
 xlab = paste("Correlation Coefficient\nN = ",n), col = "#22748A", breaks = 50)

CI = quantile(correl, probs = c(.025, .975))
cat("2.5% and 97.5% Confidence limits =", CI)

In the text we saw a t test on the statistical significance of the correlation between of life and heart disease. We can make that same test using R. The function we want is called "cor.test()", where the arguments to that function are the names of the two variables and whether you want a two-tailed test. You can see this information just by typing "?cor.test."

# Testing the significance of r for Pace data
data <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Fig9-5.dat", header = TRUE)
attach(data)
cor.test(Heart, Pace, alternative = "two.sided")

The code next shows the calculations for several of the exercises in the book. In Exercise 9.1 the only thing new is that I removed a variable from the data frame. You can't correlate the name of a country with anything, so I removed it with "data <- data[-1]." The "data[-1]" tells it to remove the first variable from data. The "-" says "remove" and the 1 identifies the variable. The argument "use = "complete.obs" tells it to use only cases that have observations on all variables."

The code for the data on Down syndrome calculates the percentage of cases at each age where the infant has Down syndrome and then plots against observed and ranked data. The next two plots are straight-forward and don't require explanation. The final snippet of code is interesting because it provides a different data set on wine consumption and heart disease from the one used in the text. In this example, r = -.84, which agrees nicely with the earlier set of data where r = -.78. The last two lines are interesting. Earlier I said that when R ran a regression it stored all sorts of stuff away where it could get at it. In that case we needed the slope and intercept of the line, and went right to the object we named "reg" to find them. It knew just where to look because whoever wrote that "plot" function told it where to look. Well it does the same thing here, except that I typed "b <- coef(lm(HeartDisease ~ Consumption)) ." It ran off, did the regression that the lm() function told it to do, and stored the slope and intercept in this regression object under the name of coef, for coefficients. There are two of them, so when I typed "b <- coef(...)" it pulled the two coefficients out of that object and called them b. Since there were two of them, they were b[1] and b[2], where square brackets indicate subscripts. Now it can draw the "abline" using those values. The way that I did it earlier was a bit simpler, but this shows you a bit more about what it is doing. (You may have noticed that I find some strange things interesting. 😃 )



## Exercise 9.1
data <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Ex9-1.dat", header = TRUE)
attach(data)
plot(InfMort ~ Income)
reg <- lm(InfMort~Income)
abline(reg)
data <- data[-1]    #Remove string variable
cor(data, use = "everything")    # use gets around missing data

DownData <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Ex9-10.dat", header = TRUE)
attach(DownData)
pctDown <- Down/Births
plot(Age,Births)
plot(Age, pctDown)
ranks <- rank(pctDown)
plot(Age, ranks)
cor(Age, ranks)

# Ex9.16
KatzData <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Ex9-16.dat", header = TRUE)
attach(KatzData)
reg <- lm(Score ~ SATV)
plot(Score ~ SATV)
abline(reg = reg)
correl <- round(cor(Score, SATV), 3)
legend(700, 45, paste("r = ", correl), bty = "n")

# Ex9.20
AddData <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Add.dat", header = TRUE)
attach(AddData)
reg <- lm(GPA ~ ADDSC)
plot(GPA ~ ADDSC)
abline(reg = reg)
correl <- round(cor(ADDSC, GPA), 3)
legend(70, 3.75, paste("r = ", correl), bty = "n")


#  A second wine and heart disease dataset
#  From http://stat.fsu.edu/~hill/STA1013/handouts/PQ5.htm
#  I don't know the source.
# Data in MoreWine.dat

data <- read.table(file.choose(), header = TRUE)
attach(data)
model1 <- summary(lm(HeartDisease~Consumption))
correlation <- cor(HeartDisease, Consumption)
plot(HeartDisease~Consumption, xlab = "Wine Consumption", ylab = "Heart Disease")
legend(5, 250,legend = c("r = ", format(cor(HeartDisease, Consumption), digits = 2)),
   bty = "n")
title("Wine Consumption and Heart Disease")
b <- coef(lm(HeartDisease~Consumption))
abline(b[1], b[2])

GreenBlueBar.gif GreenBlueBar.gif dch: