Chapter Image

Using the attach( ) command

GreenBlueBar.gif

Attach( ) is Usually a Lousy Idea

Many books that you might consult on R will include in their code the attach() command. I have used that command for several years, and found that it really is a bad idea. So I use it only rarely in my material, and when I do I feel bad. But I need to explain why, because it seems like such a nice simple thing to do.

Suppose that you read in data using the following command.


setwd("~/Dropbox/methods9/DataFiles")
myData <- read.table("Add.dat", header = TRUE )

You saw that command on the page entitled "ReadingData.html. This will create a data frame named "myData." You can think of it as a box that holds a number of variables, inside of which are the variables "CaseNum", "ADDSC", = "Gender", "Repeat", "IQ", and several more. In something like SPSS or Stata, you could simply ask the program to calculate the mean of ADDSC, for example. In that case all of the variables are readily available for your use. But in R you cannot simply type "mean(ADDSC)." The program will tell you that there is no such variable, even though you know full well that there is. In order to address "ADDSC," you need to get at it in some way. But there are good ways and bad ways of doing so. The most common way in introductory books on R is to use "attach(myData)." This essentially makes a copy of every variable in myData and puts those copies in a place (called an environment") where you can address them by name. After doing so you can type ""mean(dv)", for example, and get the mean of the dependent variable. That's nice, but very dangerous if you do a lot of fiddling, like I do, to get the whole program to finally run the way you want.

Let's examine attach() a bit more. Suppose that you create a variable named depvar. e.g. depvar <- 12. Then you read in a data frame that itself has a variable named depvar = (3,6,8,7), and you attach it. You now ask it to print depvar. Who wins?? Well, you will get 12, so the original variable wins and you don't have a variable = (3,6,8,7). Now you have another data frame that has a variable named depvar = (23, 54 34, 76), and you next attach that. Now you have 3 possible variables = depvar. i.e. (12), (3,6,8,7), and c(23,54,34,76). Now type print(depvar). Who wins now?? Well the second attachment. Hmmm! If you create a variable, it has priority over subsequent attachment. If you attach two data frames, the second has priority.

Click to see example code you can play with
depvar <- c(3, 6, 8, 7)
data1 <- data.frame(depvar)
>data1
>  3
>  6
>  8
>  7
depvar <- 12
print(depvar)
[1] 12
attach(data1)
>The following object is masked _by_ .GlobalEnv:
    depvar
    
depvar     
[1] 12   # i.e. if I create depvar = 12, that has priority over the attached version

### So the first guy there wins, right?  Not necessarily
detach(data1)
rm(depvar)
data2 <- data.frame(depvar = c(23,54,34,76))
> data2
>  depvar
>   23
>   54
>   34
>   76    
> ls()
[1] "data"  "data2" 
depvar
Error: object 'depvar' not found  #Because nothing was attached and depvar 
                                  #itself has been deleted.
attach(data1)
attach(data2)
>The following object is masked from data:

>    depvar

print(depvar)
[1] 23 54 34 76

I'm not really expecting you to memorize this result. I use it to add to my complaint that attach() is confusing and shouldn't be used. In the bad old days when I was writing R code, I would read some data, attach the data frame, and get the mean of the variable named, for example, "dv." BUT, lots of my problems have a variable named "dv" because it stands for "dependent variable." And lots of data frames get named "data" because that seems like such an obvious choice. Now assume that I go on and read in the next file that I want to work with. And suppose I name that data frame "data" as well, and I probably named my dependent variable "dv." Then suppose that I attached that in the code that I am writing and went on to get its mean. And in both cases I get the correct means. Now, being a good responsible person, I remember to detached that data frame with "detach(data). Great!! The data frame is still there, but not the individual variables. If I type mean(dv), I will get an error message telling me that there is no such variable. Good. But suppose that I forgot to detach data the first time I ran it, but I did detach it the second. (This often happens when you run your code, find an error message, correct the error, and rerun.) Now you have used attach two times, and you have the first version hidden under the second, Detaching the second allows the first version to pop into availability. The first version of "dv" is still attached in the background, so that is the one that would be called up any time I ask for dv. And that is not what I want. So I still have the earlier data set there. Then I try to run my second problem again, but this time I get a really strange mean of dv. That's because I may be addressing the previous version of dv, not the one that goes with this particular problem. And I keep studying my code and convince myself that, of course, it must be right--but why is the answer wrong?.

That example is a bit foreshortened because you would probably do a bunch of other stuff with each problem while you were at it, which just gives you more time to get forgetful and careless.

But now I have other problems to work on, so I put in some more code, read some more data, and ask for the mean of a variable named "dv." But what happens? I get the mean from the first problem. Where did that come from? I don't want that mean!!

When you attach a data frame you often get what appears to be an error message in the printout. But it is not an error message. It says "the following objects are "masked" from data". It does not say that dv replaced the earlier dv. It says that it masked it. In other words, it has temporarily hidden that version of dv. But when we later detach(data), we are detaching the second set, and that allows the original dv to bounce back again and confuse us completely. And the more you work with a set of data, the more likely you are to end up with some variable that is no longer the variable you want. And it is so hard to figure out why you can't get the right answer. (Hint: If you find yourself with this problem, it is probably easiest to simply close R and reopen it with a clean slate. RStudio will let you do that very easily with Session/Restart R. )

You might think that you can easily get out of this problem by using "detach()." But don't be fooled. If you are as careless a typist as I am, you may have to run your code four or five times until you get it right. You may have to enter detach(data1) several times to clear everything out. But there is a way around this. If you are using RStudio, go to RStudio/Preferences and check the box that tells it to always restart windows that were open when you quit. (You only have to do that once.) Now when you get really frustrated because you can't find the stupid error, go to Session/Restart R and run that. You won't lose anything important, but you will have erased all of the old stuff, not just what you erase with rm(list = ls()). When you issue this restart command, R will close and then immediately reopen with your code ready to run. (Part of the problem here is that it is perfectly reasonable to think that when you remove all variables with rm(list = ls()), you have removed all of it. But, no, that won't remove variables of dataframes that have been attached. You need detach() for that.

So what do we do???

Other than restarting , there are a couple of ways around this problem. If we have a dataframe named data1, for example, with a variable called "Score." We do not have to attach data1; we can add it to the name. In other words, we can say something like "mean(data1$Score)", and it knows to go into the data frame named data1 and get what we need. Very clever.

Alternatively you can look up the commands "with()" and "within()." They allow you to specify the data file from which you will run the next commands. You will see me use those occasionally of the code that I write. Even better, when using some of the slightly more advanced packages, many functions will allow you to add "data = data1" when you invoke the function. You cannot type (mean(dv, data = data1) , nor can you type plot(dv, data = data1, because those functions were written before people got smart about things. But you can type boxplot(dv, data = data1) because that is a newer function. That is my preference, but it will not work for all functions.

But you are going to say that is too much typing. One way that you will see me get around this problem is to read in "data1." Then, if you don't have many variables that you want to use, you can enter something like dv <- data1$dv. Now you still have a copy of dv in data1, but you also have dv as a separate variable. You can use it without playing with the data frame. If you create a new "dv," it will delete the old one rather than just jumping the line in front of it. If you look at the code that I write for each chapter, you will see that I used that approach more and more as I worked through the book. It may be messy to have to define individual variables, but it is much safer. You will also see that I very often issue a command like rm(list = ls()). That command will clean out most, but not all, of my variables so that I can start with clean copies. That command will NOT clean out the "attached" variables. I strongly recommend doing that often. And if you do have to use attach(data1), use detach(data1) as soon as possible.

You won't guess how much time I spent on the stupid problem of what the book should do with "attach()" as a general approach. Other guys use attach() and don't apologize. Well, I'm trying to do them one better. This is the best that I can do.


Specific Topics

GreenBlueBar.gif

dch