Entering Data in R

David C. Howell


Downloading the Raw Data Files

The first thing that you need to do in data analysis is to obtain the data. It will save you a lot of time and aggrevation if you download all of the data files for this book to your computer. The previous pages, that you probably have looked at, discusses downloading those data files and two other sets of files.Having done this you can then open a file without even being logged in to the Internet.

For those who have not already read those directions, go to http://www.uvm.edu/~dhowell/methods9/DataFiles/DataFilesASCII.zip. When you click on it, it will automatically download to your "Download" folder. Move it to the appropriate folder, and then click on it. Then click on the "Extract" button near the top of you screen. That is all there is to it.

Reading from a File

There are many ways of reading data into R, and I am only going to discuss a few of them. Most texts or Web pages on R will tell you about other ways. The basic idea is quite simple. You can set up a default directorory (e.g. Stat305/DataFiles), as I suggested in the previous section, and write a command to read a particular file name. Or, you can enter an entire Internet URL and have the code search for it. Finally, you can just tell R to open a file, and it will present a screen that allows you to locate and click on that file. And if you only have a very small amount of data, it is probably easiest to simply type it in.

Suppose that you took my advice and downloaded the zipped data file for the Methods book. In my particular case the unzipped files are saved in a directory (folder) named "methods9/DataFiles." If I want to open the Add.dat file from that folder I can do it in several different ways. I show three options below. The first reads from the Internet. The second specifies a directory on your computer and reads from that. The third pops up a window and has you you navigate to the file you want. The phrase "header = TRUE" indicates the first line of the data file will contain variable names. That is true of nearly all the files that I provide. Be sure that "TRUE" it is capitalized.


From the Internet:
myData <- read.table("www.uvm.edu/~dhowell/methods9/DataFiles/Add.dat", header = TRUE)

From your computer:
setwd("~/Dropbox/methods9/DataFiles")
myData <- read.table("Add.dat", header = TRUE )

Finding a file without specifying where it is:
myData <- read.table(file.choose(), header = TRUE)

Specifying a default directory saves a lot of time and aggrevation. It tells R that when I ask for a file, it should look in the folder named "~Dropbox/methods9/DataFiles," on a particular drive. The easiest way to set this up is to go to the Session dropdown in RStudio and use the dropdown menu to set the default directory. You can then copy that command from the output and place it at the top of any future R code. (This way you know that you are using the address that your machine wants, whether it is "DataFiles" or "methods9/DataFiles", or "Dropbox/methods9/DataFiles," or"~/Dropbox/methods9/DataFiles," or whatever.) The next line tells it to open "Add.dat," and, because that's all I wrote, it will look in the default directory. The "header = TRUE" is telling R that the first line of that file will contain variable names. It is important to note that when you use slashes in an URL, they have to be the kind I have used here--often called "forward slashes."

Using the Data

Once you have the data, you will want to use them. Each of the examples above will load your data to what is called a data frame. In these examples it is called "myData. A data frame is basically a box that contains data on a bunch of variables. If you were to look at the files that are available in R by typing ls(), you would get the second line in the following box. You can see that the only thing there is "myData." You don't see any of the variables that are within that data frame. If you now type the next command below, you will see the the first six lines of the data frame with all of the variable names. (Typing just "myData" would show the whole file.) However, if you entered print(ADDSC) you would be told that there is no such variable. So what is wrong?


 ls()
[1] "myData"

 head(myData)
  CaseNum ADDSC Gender Repeat  IQ EngL EngG  GPA SocProb Dropout
1       1    45      1      0 111    2    3 2.60       0       0
2       2    50      1      0 102    2    3 2.75       0       0
3       3    49      1      0 108    2    4 4.00       0       0
4       4    55      1      0 109    2    2 2.25       0       0
5       5    39      1      0 118    2    3 3.00       0       0
6       6    68      1      1  79    2    2 1.67       0       1

ADDSC and the other variables are locked away in that box. If you want to see, and use, those variables you can do several things. See the box below.


print(myData$ADDSC)
_ _ _ _ _ _

[1] 45 50 49 55 39 68 69 56 58 48 34 50 85 49 51 53
 36 62 46 50 47 50 44 50 29 49 26 85 53 53 72
[32] 35 42 37 46 48 46 49 65 52 75 58 43 60 43 51 70
 69 65 63 44 61 40 62 59 47 50 50 65 54 44 66
[63] 34 74 57 60 36 50 60 45 55 44 57 33 30 64 49 76 40 48 65 50 70 78 44 48 52 40
_ _ _ _ _ _

OR,
ADDSC <- myData$ADDSC

print(ADDSC)
[1] 45 50 49 55 39 68 69 56 58 48 34 50 85 49 51 53 36 62 46 50 47 50 44 50
 29 49 26 85 53 53 72
[32] 35 42 37 46 48 46 49 65 52 75 58 43

attach(myData)    # Please don't use this.

The first example obtains ADDSC by preappending the name of the data frame and a "$." That works just fine, but it is awkward. You will often see me use the second method. I simply make a copy of the variable named ADDSC in the data frame. Then it is a variable separate from the data frame and can be addressed by its actual name. The third method is a very bad one, but it makes all of the variables available by just entering their name. To see why I object to this method, look at attaching.html.

The following is a list of available files that tell you more about R and its use.

dch: