Doing Graphics in R

One of the things that R does best is graphics, but whole books have been written around R graphics and I can't come close to covering the whole field. I will instead cover the basic graphs and present some detail on using low-level functions to improve the appearance and the clarity of a graph.

While I was working on this page I stumbled across a web page by Frank McCown at Harding University. That page is available at http://www.harding.edu/fmccown/r/. I thought that it was particularly helpful, in part because he goes one step at a time. I have to admit that in some ways it is better than mine.

First we Need Some Data

I will use the Add.dat data found on the my Web site. This file contains data on an attention deficit disorder score, gender, whether the child repeated a grade, IQ, level and grades in English, GPA, and whether the person exhibited social problems or dropped out of school. They came from a study that Hans Huessy and I published way back in 1985. First I need to read in those data.

data <- read.table("http://www.uvm.edu/~dhowell/fundamentals9/
    DataFiles/Add.dat", header = T)
attach(data)

The simplest command for plotting the variables is the "plot()" command. I could ask it to plot the whole data frame (data), but that would give you a graphic that no mother could love unless it was absolutely needed. (Try it if you like.) Let's work, instead, with only one or two variables. The student's grade point average looks like a good candidate, so let's create a histogram of it. In fact, I will create several histograms to show you different things that I can do.

First I will set up the graphics screen so that I can show four different plots at once. I'm doing that so that I can paste all of them on one figure and save space. To do this I use the "par(mfrow = c(2,2))" command to have two rows and two columns.

Histograms

The first plot will just be a plain old basic histogram. Nothing fancy. These commands are

hist(GPA)
#------------------------------------------------------------------------
### Now add some nice labels.
hist(GPA, main = "Grade Point Average", xlab = "GPA in 9th grade",
  density = 10, col = "blue", breaks = 12)
#------------------------------------------------------------------------
### Make it even prettier
hist(GPA, main = "Grade Point Average", xlab = "GPA in 9th grade",
  density = 10, col = "blue", breaks = 12, probability = TRUE)
#------------------------------------------------------------------------
### And now get overly fancy to show how to add a neat legend and distribution.  
# I know that the mean and variance are 2.45 and 0.74
meanvalue <- expression(paste(bar(x) == 2.45))  # set this up to write out
variance <- expression(s^2 == 0.74)                  # nice text using symbols
legend(1,0.7, c(meanvalue, variance))                # The first two entries are X and Y coordinates (1 and 0.7)                               
lines(density(GPA), col = "red")                     # Fit a "best fitting" curve
#------------------------------------------------------------------------

I'm sure that you could easily figure out how to present those vertical bars in red, how to put in a different mean value, and so on. The "breaks" command specifies how many bars you want. Try changing that and see what happens. One thing that might puzzle you is the vertical axes. For the top two, I have plotted the frequency for each bar. Scores between 2.5 and 3.0 occurred a bit over 25 times, for example. In the bottom two graphs I have plotted density on the Y axis. Think of this as the "relative probability" at which each value occurs. I say more about this in the text.

Scatterplots

You might reasonably wonder how a child's GPA is related to his or her IQ. For that we want a scatterplot, with GPA on the Y axis and IQ on the X axis. In the following command you can read "~" as meaning "as a function of." I didn't want some boring old circles or dots to represent the data points, so I set the plot character to 7, which gave me those nice little boxes with an x in them. Try changing to some other value.

But I wanted to add a best fitting line to those points. To get that I had to jump way ahead in the book to calculate a regression line and then plot that line with the "abline() function. You can ignore that if you wish. Finally, I wanted to put the actual correlation on the figure, but I didn't know where there would be space. So I used the "locator" function. It plots the scatterplot and then sits and waits. When I use my cursor to click on a particular open space on the screen, it writes the legend there. That's kind of neat.


plot(GPA ~ IQ, main = "GPA as a Function of IQ", sub = "9th grade data", pch = 7)
regress <- lm(GPA ~ IQ)            # So that we can add a regression li
abline(regress)
r <- round(cor(GPA, IQ), digits = 3)
text(locator(1), paste("r = ",r))  #Cursor will change to cross. 
				   #Click where you want r printed.

Barplots

When we have a continuous variable, such as GPA, and a discrete variable, such as EngL (the level of English class the student was in), we would often like to plot one against the other. That is not particularly hard to do. First we want to specify that EngL is a factor, so that R doesn't treat it as continuous. Then we want to make up a table giving the mean GPA, standard deviation, and sample size at each level of EngL. We can do both of these things in the code below.

The first half of the code, down through the barplot command, produces a standard bar graph. After that I got a bit fancy. We already know what the sample means are, but this part of the code sets confidence limits on each mean. I don't discuss these until several chapters later, but the code is here if you ever want it in the future. Otherwise you can ignore it. The "tapply" command can be read to say "create a table where GPA is broken down by EngL, and the mean is calculated for each category. (The others calculate the standard deviations and sample sizes.)

### Barplots
library(Hmisc)        # Needed for "errbar()"
EngL <- factor(EngL)
levels(EngL) <- c("Coll Prep", "Normal", "Remedial")
Grades <- tapply(GPA, EngL, mean)
stdev <- tapply(GPA, EngL, sd)
n <- tapply(GPA, EngL, length)
barplot(height = Grades)
top <- Grades + stdev/sqrt(n)
bottom <- Grades - stdev/sqrt(n)
barplot(Grades, ylim = c(0,3.5), ylab = "Grade Point Average", xlab = "English Level")
errbar(x = c(1,2,3), y = Grades, yplus = top, yminus = bottom, xlab = "English Course 
Level", ylab = "Grade Point Average", add = TRUE)

Line Graphs

I will start with a very simple line graph and then move on to something more interesting. Line graphs are a very nice way to show how something changes over time or over the levels of another variable. We will use a dataset given early in the book based on an example of identifying rotated characters as a function of the degree of rotation. The code for reading these data is below, but you probably don't want to get too involved in the syntax. From the graph it is clear that as a character is rotated, it takes longer to identify whether or not it is a specific letter. If you just want a very basic line graph, you can simply use the command "plot(means, type = "l"). The rest of the code focuses on making things pretty. (I have a small confession. It took me a lot of work to get this to come out just the way I wanted. You can play with it with a good bit of trial and error. or you can just use the basic plot.)

data <- read.table("http://www.uvm.edu/~dhowell/
      fundamentals9/DataFiles/Tab3-1.dat", header = TRUE)
attach(data)
means <- tapply(RTsec, Angle, mean)    
     # Mean reaction time for each angle of rotation
print(means)
	0      20       40       60       80      100      120      140     160     180 
1.408000 1.298333 1.470833 1.663500 1.655500 1.738833 1.730000 1.796167 1.611500 1.872333 
plot(means, main = "Means as Function of Angle of Rotation", 
     xlab = "Angle of Rotation", ylab = "Mean Reaction Time", 
     type = "l",
     axes = FALSE)
     axis(1, at = 1:10, lab = c("0", "20", "40", "60", "80", 
          "100", "120", "140", "160", "180"))
     axis(2, at = c(0, 1.2, 1.4, 1.6, 1.8, 2.0))

Now lets go ahead to something only a bit more advanced. I am going to jump ahead to include what is called an interaction plot. I am doing so because this is a function that I think that you should know about, even if you don't have a use for it until later. The basic idea is that we have a variable (call it score) whose mean changes over conditions. Moreover, the change across conditions varies for different groups. To take a specific example from Chapter 13, Spilich and his students looked at how task performance varied as a function of smoking (non-smokers, delayed smokers, and active smokers). They looked to see if changes due to smoking varied across different tasks. Basically, this is three line plots on the same graph, one for each type of task. These data are found in Sec13-5.dat.

data <- read.table("http://www.uvm.edu/~dhowell/methods8/DataFiles/Spilich.dat", header = TRUE)
                   
head(data)  #Ignore variable labeled "covar"
Task Smkgrp score covar
1    1      1     9   107
2    1      1     8   133
3    1      1    12   123
4    1      1    10    94
5    1      1     7    83
6    1      1    10    86 

attach(data)
Task <- factor(Task, levels = c(1, 2, 3), labels = c("PattReg", 
                                                     "Cognitive", "Driving"))
#  Notice how I supply labels for the various levels.
SmokeGrp <- factor(SmokeGrp, levels = c(1,2,3), labels = 
                      c("Nonsmoking", "Delayed", "Active"))
interaction.plot(x.factor = SmokeGrp, 
                 trace.factor = Task, 
                 response = Errors, 
                 fun = mean)

Cute Stuff

I have spent a lot of time figuring out the easiest way to add mathematical symbols to text and to include either a predefined value of some variable or the value that the program has just calculated. So you are going to have to either put up with a section on how to do that or else skip it. At least it is here if you need it. You certainly don't need it now.

To add math symbols you need to use a function like "expression", "substitute", "bquote", and others. I hate to say it, but these are not described well in any source that I have seen, especially in the help menus. (I get them to work mainly by trial-and-error or by copying what someone else did--and many of those people say they did it by trial and error themselves.) So if you need to do something like I do below, just take my code and substitute your own commands and variables in place of mine. And then see if it works. And then fix it again and see if it works, ... (I have been down this path many times.) The following does work, but I would hate to have people know how long it took me to get that equation to come out the way I wanted it.

What I have done is to draw a chi-square distribution on 3 df and then write in the equation for the generic chi-square as a legend. The code and result are shown below.

 
x <- (0:200)/10
a <- dchisq(x,3)          
     #get the density (height) of chisq at each value of x
	 z <- expression(chi^2)
myequation <- expression(paste(chi^2, " = ", frac(1,2^(frac(k,2))*Gamma(frac(k,2) ) )*
chi^2^((k/2)-1)*e^frac(-chi^2,2)))
plot(a ~ x, type = "l", ylab = "density", xlab = z)
legend(5, .20, legend = myequation, bty = "n")    
     # bty = "n" tells it no box around the legend

A Related Problem

It is one thing to know in advance what you want to write on a figure. (e.g. you want to write that chi-square = 2.45.) But what if you want to write a mathematical expression, such as the Greek chi-square, and then follow it with a variable computed from the data. (You don't know what that value will be, so you can't just type it in your command.) In other words, instead of using legend(7,.2,"chi-square = ",67.73), you want to write "chi-square" in Greek, and then follow it by the actual calculated value. That is a bit trickier. You will need to use "bquote" here. (Typing ?bquote is not likely to leave you any more informed.) The code follows. I am drawing 5000 observations from a chi-square distribution, sorting them, and finding the value that cuts off the top 5%. That value will change slightly with each run because these are random draws.

nreps = 5000
csq <- rchisq(nreps,3)                   
     # Draw 5000 values from a chi-sq dist on 3 df
csq <- sort(csq)                         
     # Arrange the obtained values in increasing order
cutoff <- .95*nreps
chisq = round(csq[cutoff], digits = 3)   
     # The 950th value in ordered series
hist(csq,xlim=c(0,20),ylim=c(0,.3), 
main = "Chi-square on 3 degrees of freedom",breaks = 50, 
+ probability = TRUE, xlab = "Chi-Square") 
     #The + sign supplied by R to indicate continuation of 
     #previous line.
legend(7,.2,bquote(paste(chi[.05]^2  == .(chisq))))
lines(density(csq), col = "blue")

The combined code for all of these graphics is given below. You can easily cut and paste this into any editor and run it in R. I have added the command "par(ask = TRUE)" at the top. This means that after printing the first graphic it will pause when you ask for the next graphic and wait for you to tell it to print that graph. You will see the word "next" in little (very little) type in the upper left corner of the graph. Just click on that. (You can comment the "par(ask = TRUE)" line out if you want to submit commands one by one rather than as a batch. You may have to use the console to enter "par(new = FALSE)" if you have run the code before and have "TRUE" stuck in the on position. The other trick is that the fifth graphic has a "locator function." This means that nothing more will happen until you put the cursor somewhere on the graph where you want the text and click once.

# Graphics page for Methods 9
data <- read.table("http://www.uvm.edu/~dhowell/methods9/
    DataFiles/Add.dat", header = T)
attach(data)
#-----------------------------------------------------------
hist(GPA)
#------------------------------------------------------------------------
### Now add some nice labels.
hist(GPA, main = "Grade Point Average", xlab = "GPA in 9th grade",
  density = 10, col = "blue", breaks = 12)
#------------------------------------------------------------------------
### Make it even prettier
hist(GPA, main = "Grade Point Average", xlab = "GPA in 9th grade",
  density = 10, col = "blue", breaks = 12, probability = TRUE)
#------------------------------------------------------------------------
### And now get overly fancy to show how to add a neat legend and distribution.  
# I know that the mean and variance are 2.45 and 0.74
meanvalue <- expression(paste(bar(x) == 2.45))  # set this up to write out
variance <- expression(s^2 == 0.74)                  # nice text using symbols
legend(1,0.7, c(meanvalue, variance))                # The first two entries are X and Y coordinates (1 and 0.7)                               
lines(density(GPA), col = "red")                     # Fit a "best fitting" curve
#------------------------------------------------------------------------

plot(GPA ~ IQ, main = "GPA as a Function of IQ", sub = "9th grade data", pch = 7)
regress <- lm(GPA ~ IQ)            # So that we can add a regression li
abline(regress)
r <- round(cor(GPA, IQ), digits = 3)
text(locator(1), paste("r = ",r))  #Cursor will change to cross. 
				   #Click where you want r printed.


#Adding mathematical symbols and values

a <- dchisq(x,3)          
     #get the density (height) of chisq at each value of x
	 z <- expression(chi^2)
myequation <- expression(paste(chi^2, " = ", frac(1,2^(frac(k,2))*Gamma(frac(k,2) ) )*
chi^2^((k/2)-1)*e^frac(-chi^2,2)))
plot(a ~ x, type = "l", ylab = "density", xlab = z)
legend(5, .20, legend = myequation, bty = "n")    
     # bty = "n" tells it no box around the legend
   

# But this won't allow us to have the value of a variable 
# printed. So We use bquote to do this. (I don't understand 
# bquote, but it works. Calculate the value that cuts off the 
# top 5% of this particular distribution of 1000 values

nreps = 5000
csq <- rchisq(nreps,3)            # Draw 5000 values from a 
     # chi-sq dist on 3 df
csq <- sort(csq)
cutoff <- .95*nreps
chisq = round(csq[cutoff], digits = 3)   
     # The 950th value in ordered series
hist(csq,xlim=c(0,20),ylim=c(0,.3), main = "Chi-square 
on 3 degrees of freedom",
  breaks = 50, probability = TRUE, xlab = "Chi-Square")
legend(7,.2,bquote(paste(chi[.05]^2  == .(chisq))), 
bty = "n")
lines(density(csq), col = "blue")

Specific Topics

dch: