GreenBlueBar.gif GreenBlueBar.gif

R Code for Chapter 10

R Regression


The first two sets of code are relatively straightforward. The scale for the stem-and-leaf diagrams was set to 1 to reproduce the plots in the text. I set the boxplots to light gray just to be different. Here I used "par(mfrow = c(2,1)" to put both plots on the same screen, one above the other. When it came to the scatterplots I set par back to c(1,1) so that the plot would fill the whole space and look better. I tried to be cute by plotting stars rather than points or circles for the individual data points. If you want to do more of this kind of thing, type ?plot and, near the bottom, click on "points." At the bottom of that screen you will see the codes that make different shapes. I used "pch = 8" Usually for parameters like that you put the character in quotes, but if you do so here you will plot the number 8 for each data point, so leave out the quotes.The "pch" parameter stands for "plot character."


### Table 10.1 
stress.data <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/
Tab10-1.dat", header = TRUE)
attach(stress.data)
head(stress.data)
stem(Stress, scale = 1)
stem(Symptoms, scale = 1)
par(mfrow = c(2,1))
boxplot(Stress, horizontal = TRUE, col = "lightgray")
boxplot(Symptoms, horizontal = TRUE, col = "lightgray")

### Figure 10.1
model1 <- lm(Symptoms ~ Stress)         # lm represents 'linear model.'
summary(model1)                                    # print out the results
par(mfrow = c(1,1))
plot(Symptoms ~ Stress, main = "Symptoms as a Function of Stress", xlab = "Stress", 
   ylab = "Symptoms",pch = 8)
abline(model1, col = "red")                   # This is why I needed to use lm()
legend(35,160, expression(hat(Y) == 0.7831*Stress + 73.891), bty = "n")


In the next plot, which is the plot on narcissism, notice how I set that up. I asked for plotting NPI as a function of Year. That is what the "~" is all about. You can change that command to "plot(Year, ,NPI) and see what you get, but I assure you that you won't like it. I could give an explanation of why the plots are done this way, but trial and error is one of the better ways to get what you want. You could type ?plot, go to the section on "See also," and then click on "plot.formula" to see a discussion of this topic.



### Narcissism
  
  Year <- c(1982, 1986, 2002, 2003, 2004, 2005, 2006, 2007)
  NPI <- c(.39, .38, .37, .38, .38, .38, .38, .39)
  correl <- cor(Year, NPI)
  mod1 <- lm(NPI~Year)
  plot(NPI ~ Year, type = "l", ylim = c(.30, .40))
  summary(mod1)
  

The next section deals with the actual data that Sir Francis Galton collected on the heights of parents and their children. The data file contains the "midparent" height, which is the average of mother and father. This is a very famous data set and played an important role in psychology in the first half of the 20th century. Even now it illustrates a very important point.

Galton collected the heights for 928 families (Galton didn't think small), and these can be found on the Web site for this book under the name Galton.dat. One of the first things you might notice when you run this analysis is that the mean heights of the midparent and the child are very nearly equal. As a group they don't really change very much.The standard deviation of children are is larger, but that would happen if they are not all of the same age. Part of the variability is due to age.

The correlation is a bit lower than I would expect, being .46, which means that the "predicted" height of children is only partly a function of the height of their parents. Yes, overall the heights of parents and their children are correlated, but we would expect that. The slope is .64. Many other things enter into the final height. The slope of plotting the two against each other is less than 1. Tall parents have shorter children and short parents have taller children--on average. This is what Galton was referring to when he spoke of "regression to mediocrity." In the first plot I draw both the actual regression line and the line that we would have if the slope was 1.0. Notice that they are quite different.

The plot that I just looked at is pretty messy because of so many overlapping data points and the fact that the heights are rounded off and overlap. But I then went ahead and found the quantiles (e.g., the lowest score, the 25th percentile score, the 50th percentile, the 75th percentile, and the highest score.) I did this to remove a lot of the variability in the data. If you draw this plot you again see that the slope is less than one. Notice that this plot tells roughly the same story.



### Regression to the mean
### Using Galton's data on heights of parents and their children

DF <- read.table(file = "http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Galton.dat",
 header = TRUE)
attach(DF)
mean(Midparent); sd(Midparent)
mean(Child); sd(Child)
cor(Midparent, Child)
model1 <- lm(Child~Midparent)
summary(model1)
plot(Child ~ Midparent)
abline(model1, col = "red")
abline(0,1, col = "blue")

qp <- quantile(Midparent)
qc <- quantile(Child)
model2 <- lm(qc~qp)
plot(qc~qp, ylim = c(60,70), xlim = c(60,70) )
abline(model2)                        # Plot best fittinng line
abline(0,1)                                # Plot line with slope of 1

In Figure 10.1 and its analysis I basically just plotted the data. I had to create the regression model to allow me to draw the regression line, but I didn't discuss it very much. In the next block of code I go back to that analysis, but focus on the printout of the regression.

The first thing that I did was to simply ask for a printout of model1. That was really not very exciting, as you can see. All that it gave me was the slope and intercept. I don't know why anyone would think that it is all I wanted, but that's what I got. (Note: The fact that I wrapped the command in print() is irrelevant. I would have had the same result if I had merely said "model1.") But when I asked for a summary of model1, I obtained something much more like I would expect. If you look at the SPSS printout in Figure 10.3, you will see almost exactly the same results. The order of the printout is a bit different, but that's not important.

At the bottom of the code and printout is something that I have not discussed in the book. I am going further than I normally go about R, but this is a good place to do so.


### Table 10.1 again, in conjuction with Figure 10.3
stress.data <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/DataFiles/Tab10-1.dat", header = TRUE)
attach(stress.data)
model1 <- lm(Symptoms ~ Stress)     # lm stands for 'linear model.'
print(model1)  
          
############### Printout ############
Call:
lm(formula = Symptoms ~ Stress)

Coefficients:
(Intercept)       Stress  
    73.8896       0.7831  

print(summary(model1))                  # print out the results using summary

############### Summary Results #######
Call:
lm(formula = Symptoms ~ Stress)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.347 -13.197  -1.070   6.755  82.352 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  73.8896     3.2714  22.587       < 2e-16 ***
Stress             0.7831     0.1303    6.012    2.69e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.56 on 105 degrees of freedom
Multiple R-squared:  0.2561,	Adjusted R-squared:  0.249 
F-statistic: 36.14 on 1 and 105 DF,  p-value: 2.692e-08
##############################################################

names(model1)
############### Printout ###########
names(model1)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
 [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"      


At the bottom of that code I asked for "names(model1)." I said earlier that as the lm() function performed its duties, it stored away all sorts of stuff. For one thing it stored away the regression coefficients (slope and intercept). That is all you get when you ask it to print out model1. But it also calculated all of the predicted values, subtracted them from the obtained values, and stored those answers away as "residuals." It stored away all sorts of stuff that you probably don't care about, but that might be useful in the future. If I had typed "names(summary(model1))", I would see a somewhat different set of results, including "sigma".


> names(summary(model1))
 [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"        
 [7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 
 

Notice that there are all sorts of other things stored there. Now if in a new calculation I needed the standard error of the regression, I could just include it in some equation by inserting "summary(model1)$sigma." That may not look like something that you might want to do, but it is a perfectly reasonable thing for some people to want. The reason that I wrote this section is simply to illustrate that when R performs a set of calculations, it stores much of what it computes in "objects," and those objects can be called on to pull out specific statistics. Now I'll stop bothering you about objects--at least for a while.


GreenBlueBar.gif GreenBlueBar.gif dch: