Using the R Programming Environment

Introduction

GreenBlueBar.gif GreenBlueBar.gif

In recent years a programming environment named R has become more and more popular. It was originally written under the name "S" at Bell Labs by Chambers, Becker, and Wilks in the early 1990s or late 1980s. Along the way the companion programming environment named R was introduced, along with a commercial product named S-PLUS. There are some minor differences among the three, but nothing major. R appears to have made S obsolete, and I don't think that you can even find it anymore. We will concentrate on R because it is freely available with a huge collection of functions that other people have added to it and are continuing to add. I don't know the formal distinction between a language and a programming environment--people would call Fortran a language but they call R a programming environment. Similarly, given my (old) background, I would call what we write a "program," whereas others would call it a code snippet or a command file. I'll stick with "program" even if that is out-of-date.

It is important to recognize that R came out of the Unix operating system environment, which explains some of the feature you will come across. For example, Unix (and Linux) is case-sensitive, so "Print" and "print" are two different commands, and "Print" doesn't exist and will give an error message. To those of us who grew up with early programming languages or with Windows and the Mac, Unix takes a bit of getting used to. If someone tries to get you to write the command "print("Hello World")" or refers to a dummy file name as "foo" or "foobar," its a good bet you are up against a Unix creature. (They all love instructing you to write out "Hello world" as your first step -- can't we be more creative?)

Whereas most people generally write a program and then execute it, Unix types frequently like to work with what is called "the Command Line." This means that you type a command and it is executed, then you type the next command and it is executed, etc. We will do a lot of that, but it takes some getting used to. We will also combine commands into a "program" and execute that all at once. Finally, Unix creatures have a command called "man" which prints out help ("manual") pages. So if you don't know how to change your working directory you type "man(cwd)" and it will tell you. (Of course that assumes that you know that the name of the command is "cwd," but doesn't everyone know that?) R uses the same kind of help system, although the command is "help(setwd)" or, equivalently, "?setwd". That's great because help is always available, but its bad because the help pages are not always as clear as you would like--in fact some of them make no sense to me.

These pages will not make you an accomplished R programmer. I hope that they will at least make you sort of a half-assed programmer. If they do, there are lots of books that will help you take it from there. My intent is to show you how to read in data, how to transform them if necessary, and how to use them to perform statistical calculations. Although R is not a statistical language, its greatest development has been in the fields of statistics, about which I know a reasonable amount, and bioinformatics, about which I know less than nothing. There is almost nothing in statistics that you can't do in R, and if you want to do something even slightly complicated, such as computing a logistic regression, someone has already been there ahead of you and written functions to do that. You just have to call the function and give it the right information.

Good texts to use

If you want a decent text for R , and I hope you do, there are a couple that I can recommend. Everitt and Hothorn (2006), A handbook of statistical analyses using R is a good gentle introduction. Maindonald and Braun (2007), Data analysis and graphics using R--an example-based approach, second edition is quite good, but in many places the statistical analyses overcome the learning of R, and the code you are looking for is buried without much explanation in an example. Finally, perhaps the best of the straight R books is Crawley (2007) The R book. I recently bought Norman Matloff's "The Art of R programming," and I like it a lot. Many books about R have titles such as "Learning Statistics by way of R"." Matloff's book is more on the order of "Learning R by way of Statistics." However Matloff often assumes that you know commands that he hasn't yet discussed. It is an excellent book for learning a BETTER way of doing something, but it is probably not the book to start with. And then there is R in Action by Robert Kabacoff. It does a nice job of threading the way between learning statistics and learning R, That is probably my second choice of the best R book. But first of all, look for tutorials on the Web. There are lots of them and some are quite good. Just ask Google. One that I would recommend is http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf, which covers quite a bit of material in a very comprehensible way. But one that I recently discovered and like even better is by Kelly Black at http://www.cyclismo.org/tutorial/R/

First you need some data

Well, that isn't completely true, but data will help. Most of my examples that use data involve data files from Statistical Methods for Psychology, 8th ed. (There is a similar source for data from the 8th edition of Fundamental Statistics for the Behavioral Sciences.) You can download these data from DataFilesASCII.zip. I would strongly recommend that you create a folder on your machine named R-Stuff, or something like that, and put the (unzipped) data files in a sub-folder of that folder named Data. (You will have to click on the downloaded file to unzip it.) If I were really organized, that's what I would do. But unfortunately I'm not really organized. Having the data in that one place makes it much easier to get at the files when you need them.

An Outline of these Pages

I am going to split these pages into several different units just so that no unit becomes too long. You can always click your way there from here. I will begin with a page on downloading R and related files. As I said elsewhere, if you can install iTunes you can install R. But along with R it is helpful, but not required, to have a good editor. I will discuss a couple of those in that section.

Next I will examine a simple example in which you enter some commands, set up some data, and run an analysis. Because this is the beginning, and many people will be using these pages along side an ongoing statistics course, the first few examples will involve fairly elementary statistics. In this section I am not going to say much about the specific commands we will use. I just want you to see what can be done.

In the following section I will lay out the basic information about reading in data, creating new variables, doing some simple calculations, and printing out results. This section will mainly focus on data manipulation, which R is very good at. I can not possibly burden you with everything that R will do, but we will cover the basics.

One of the things that R does best is graphics. We will have a whole section devoted to creating meaningful graphs. My goal is to give you annotated code so that you can later steal that code, change the variable names and the text, and produce the same kinds of graphs. Personally I find it easiest to learn by looking at what someone else did and then adapting it to my needs. That is what this section will attempt to do.

Specific Topics

dch:

Free JavaScripts provided
by The JavaScript Source