GreenBlueBar.gif GreenBlueBar.gif

R Code for Chapter 6

Exploring Group Means with t


This chapter is mainly about the t test, with several interesting graphs. The chapter begins with data from a study by Everitt (1994) on the treatment of anorexia. Notice that I am only looking at the group that underwent family therapy. Below I will show the code to enter the data for the control group and the Cognitive Behavioral treatment, and you can apply an analysis similar to the one given here.

In Figure 6.1 in the book you see several figures created from a small data set. In the code below I drew samples with (n = 5 and then again with n = 30). My goal was to illustrate how the distributions of means change with increases in sample size. Notice that as the sample size increases, the distribution of means decreases. Notice the Q-Q plots, which illustrate that with increased sample size, the distribution of sample means becomes noticeably more normal. The code for this plot follows. I again begin this code with a statement that deletes (removes) old variables that may be hanging around and getting in our way. That is always a good thing to do. All you need to do is enter rm(list = ls()) . Another thing to notice is that I frequently use par(mfrow = c(2,1) , or some other combination of numbers to cut the output screen to the desired shape. This particular command would create a screen with 2 rows and 1 column. Notice also that I have occasionally condensed the code slightly by combining two commands on the same line with a semicolon between them. This is an example where it is easy to read in the data, rather than load it from a file.

I believe that this is the first time that I have used a command like:

for (i in 1:nreps) { means5[i] <- mean(sample(y,5)) means30[i] <- mean(sample(y,30)) }

In the old days this was called a "do loop." It begins with i = 1 and computes means5[1} (the first mean of 5 observations) and means30[1]. Then it sets i to 2 and computes means5[2] and means30[2]. Next we compute means5[3] etc., all the way up to means30[nreps], where nreps = 10,000. So we have 10,000 means from a sample of size 5 and 10,000 means from a sample of size 30. Then it moves on. It is a very efficient way of doing something 10,000 times and saving the 10,000 answers. The commands that it repeats 10,000 times are those between the { }. Before I can run this, however, I have to create variables (means5 and means30) that will store the means that I calculate. I also have to create a variable (here called y) that R can sample from to get the data for this trip through the loop. On the web page for this chapter I show how I created y as a normal distribution with a specified mean and standard deviation. Notice that in the following example I sample from a uniform distribution, though I could chose a different distribution if I wished.

After I have run through the loop and have the 10,000 means for each sample size, I plot them with a histogram. The resulting means run from roughly 0 to 100, so I create a variable that contains the sequence of digits from 0 to 100. Then, given the obtained mean and standard deviation of the obtained means, I compute the height of the normal distribution for each value of x. Once I have that I can plot y as a function of x. You have seen the plotting of Q-Q graphs earlier.


R Code

# Figures 6.1a and b
rm(list = ls())      # It is a good idea to clean house like this occasionally.
nreps = 10000
y <- runif(5000, 0, 100)   # y will be 5000 numbers uniformly distributed between 0 and 100
hist(y, col = "blue", density = 10, main = "Uniform Distribution")  # Plot distrib. of y
# Sample y repeatedly  with sample size = 5 and 30
means5 <- numeric(nreps); means30 <- numeric(nreps)
for (i in 1:nreps) {
     means5[i] <- mean(sample(y,5))
     means30[i] <- mean(sample(y,30))
}
x <- seq(0,100)   # Preparatory to fitting normal distribution
xbar5 <- round(mean(means5),2) ; sd5 <- round(sd(means5),2)
xbar30 <- round(mean(means30),2) ; sd30 <- round(sd(means30),2)

par(mfrow = c(2,2))    #Set up 2 X 2 plotting
hist(means5, col = "blue", density = 10, 
     main = "Sampling Distribution\n of mean, n = 5", breaks = 20, freq = FALSE,
     ylim = c(0,.05), xlim = c(0,100), xlab = "Mean")
legend(60, .035, paste("mean =", xbar5), cex = .85, bty = "n")
legend(60, .032, paste("sd =", sd5), cex = .85, bty = "n")
legend(60, .029, paste("n = 5"), cex = .85, bty = "n")
par(new = TRUE)    # This will superimpose the following plot.
y <- dnorm(x,mean = xbar5, sd = sd5)   # Add a normal distribution
plot(y~x, type = "l",ylim = c(0,.05), xlim = c(0,100), xlab = "", ylab = "")
qqnorm(means5, main = "Q-Q plot for n = 5")
qqline(means5)

##### Now repeat this with n = 30.
hist(means30, col = "blue", density = 10, 
     main = "Sampling Distribution \n of mean, n = 30", 
     breaks = 20, freq = FALSE,  xlim = c(25,75), ylim = c(0,.10), xlab = "Mean")
legend(55, .08, paste("mean =", xbar30), cex = .85, bty = "n")
legend(55, .07, paste("sd =", sd30), cex = .85, bty = "n")
legend(55, .06, paste("n = 30"), cex = .85, bty = "n")
par(new = TRUE)
y <- dnorm(x,mean = xbar30, sd = sd30)
plot(y~x, type = "l", ylim = c(0,.10), xlim = c(25,75),xlab = "", ylab = "")
qqnorm(means30, main = "Q-Q plot for n = 30")
qqline(means30)

Now I want to produce the distribution shown in Figure 6.2. This again uses only the data from the Family Therapy group. However the code in Chapter6.R reads in data from all three groups. But it is easy just to enter the data by hand. Notice that after I plot the histogram I set par(new = TRUE), which allows the next plot to be superimposed on the first one. I superimposed a normal distribution. Then I used the density function to superimpose a best-fitting curve, which takes into account that there is a bump in the lower half of the curve.


par(mfrow = c(2,1))
hist(FT.Gain, freq = FALSE, ylim = c(0, .08), xlim = c(-10, 25))
par(new = TRUE)   # Plot on top of last plot.
curve(dnorm(x, mean=mean(FT.Gain), sd=sd(FT.Gain)), 
      col="darkblue", lwd=2, add=TRUE, yaxt="n")
par(new = TRUE)
plot(density(FT.Gain), main = "", ylim = c(0,.08), xlim = c(-10,25), xlab = "")
_____________________________________________________________

I next show a plot of the relationship between the amount of weight a girl gained and the weight at which she started treatment. Does a girl who started out weighing somewhat more than others also gain more than others? (The answer, you will see, is no.) Most of this is clear, but I want to make one point about drawing the regression line. (The straight line which best fits the data.) We first have to calculate what the regression line would be. We do that with reg <- lm(Gain ~ Before) , which you have not seen before and is discussed in Chapter 9. When that code runs it will create an "object" named "reg." It, in turn, will contain not only the correlation, but also the coefficients for the equation of the best fitting regression line, the residuals, the fitted values (both of which you don't yet need to know about), and all sorts of other stuff. When I type abline(reg), R knows to look in reg, find the regression coefficients, compute the form of the straight line, and draw it on the figure. That's pretty clever. Notice how I add the correlation in the legend.


rm(list = ls())
par(mfrow = c(2,2))
data <- read.table("Tab6-1.dat", header = TRUE)
Before <- data$Before
After <- data$After
Gain <- After - Before
mean(Gain)
plot(Before, Gain, main = "Correlation between Amount Gained \n 
and Pretreatment Weight")   
reg <- lm(Gain ~ Before)
abline(reg)
legend(85, 18, paste("r = ",correl), bty = "n")


Now I want to plot the distribution of Family Therapy data, and then use standard confidence intervals and violin plots to explore the data a bit more. The error.bars() command includes all sorts of stuff between the parentheses. Not all of this is needed. I suggest typing ?error.bars to find out what all of this is about.You can easily adapt this to handle the other two conditions. I also include the paired sample t test using difference scores and then again using pre- and post- scores.


### Now lets look at Gain with Family Therapy, the t test, and the error bars
library(psych)   # Needed for "error.bars.by"
par(mfrow = c(2,3))
hist(FT.Gain)
error.bars(FT.Gain, eyes = FALSE, xlab = "", ylab = "Weight Gain", ylim= c(-1, 15 ),
           main = "95% Confidence Limits on \n Weight Gain", arrow.col = "red", alpha = .05)
legend(.5, 1.3, "H(0)", bty = "n")
# Coordinates are from trial and error
error.bars(FT.Gain, eyes = TRUE, xlab = "", ylab = "Weight Gain", ylim= c(-1, 15 ),
           main = "95% Confidence Limits on \nWeight Gain", col = "lightblue", alpha = .05)

legend(0,.9,"______________________________________", bty = "n")
legend(.5, 1.3, "H(0)", bty = "n")

### Now to the t-test
t.test(FT.Gain,  alternative = "two.sided", mu = 0, conf.level = .95)

### OR< EQUIVALENTLY USING BEFORE AND AFTER INSTEAD OF GAIN,
t.test(x = FT.After, y = FT.Before, paired = TRUE, alternative = "two.sided",
 mu = 0, conf.level = .95)
 __________________________________________________________________
 
 One Sample t-test

data:  FT.Gain
t = 4.1849, df = 16, p-value = 0.0007003
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  3.58470 10.94471
sample estimates:
mean of x 
 7.264706 

> 
> ### OR, EQUIVALENTLY, USING BEFORE AND AFTER INSTEAD OF GAIN,
> t.test(x = FT.After, y = FT.Before, paired = TRUE, alternative = "two.sided",
+        mu = 0, conf.level = .95)
__________________________________________________________________

	Paired t-test

data:  FT.After and FT.Before
t = 4.1849, df = 16, p-value = 0.0007003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  3.58470 10.94471
sample estimates:
mean of the differences 
               7.264706 

 ______________________________________________________________
 

This code shows several things: 1) the plot of error bars, including violin plots, 2) the calculation of t tests for two related, or dependent samples, either by working with the "before" and "after" data, or working with "gain" scores. 3) The t test, using the t.test() function, produces 95% confidence intervals by default, although you can change that to any interval you want, (Just enter "?t.test" from your console.) I loaded the "psych" library because it handles error bars nicely.

Those data were for the Family Therapy group. Now I will enter data for the Cognitive-Behavior therapy group and the Control group. We can perform the same graphical analyses on these.

Control Group Data.

If you are using the code from Chapter6.R, you will already have read in the full dataset as anorexia.Data


Control.Before <- c(80.7, 89.4, 91.8, 74.0, 78.1, 88.3, 87.3, 75.1, 80.6, 78.4,
 77.6, 88.7, 81.3, 78.1, 70.5, 77.3, 85.2, 86.0, 84.1, 79.7, 85.5, 84.4,
 79.6, 77.5, 82.3, 89)
Control.After <- c(80.2, 80.1, 86.4, 86.3, 76.1, 78.1, 75.1, 86.7, 73.5, 84.6,
 77.4, 79.5, 89.6, 81.4, 81.8, 77.3, 84.2, 75.4, 79.5, 73.0, 88.3, 84.7, 81.4,
  81.2, 88.2, 78.8)
Control.Gain <- Control.After - Control.Before

Cognitive Behavioral Data


CogBehav.Before <- c(80.5, 84.9, 81.5, 82.6, 79.9, 88.7, 94.9, 76.3, 81.0, 80.5,
 85.0, 89.2, 81.3, 76.5, 70.0, 80.4, 83.3, 83.0, 87.7, 84.2, 86.4, 76.5, 80.2,
 87.8, 83.3, 79.7, 84.5, 80.8, 87.4)
CogBehav.After <- c(82.2, 85.6, 81.4, 81.9, 76.4, 103.6, 98.4, 93.4, 73.4, 82.1,
 96.7, 95.3, 82.4, 72.5, 90.9, 71.3, 85.4, 8106, 89.1, 83.9, 82.7, 75.7, 82.6,
  100.4, 85.2, 83.6, 84.6, 96.2, 86.7)
CogBehav.Gain <- CogBehav.After - CogBehav.Before

I use code very similar to the above to do the same thing with the distribution of variances. I wanted to show you that the distribution of the variance is very skewed, which can be seen in both the histogram and the Q-Q plot. To create that code, you basically need to change "mean" to "var" and make a few other minor adjustments. I do not show the code here, but you can see it if you download the R file at www.uvm.edu/dhowell/methods9/R-Code/Chapter5.R.

Next I want to use the data from Zhong to plot the error bars for the Control and Experimental group means, and also for the mean difference. Notice that I did something new here. I didn't want to use that awful attach(), so I used "with." The structure is "with(dataframe,{a bunch of code}). This runs the code on the data frame, similar to what you might get with attach, but safer. The code follows. Notice that I prevented overlapping labels by setting the "main" and "ylab" to blanks in the second section. (This is an example of code that I would copy and file away somewhere so that I can easily pull it out in the future. It is equivalent to replacing "with Zhong.data {" with "attach (Zhong.data), followed by the instructions about error bars,etc., and then replacing the "})" with "detach(Zhong.data). )


### plot confidence interval for Control and Experimental
library(psych)
par(mfrow = c(1,1))
Zhong.data <- read.table("Tab6-5.dat", header = TRUE)
describe(Zhong.data)
with (Zhong.data,{
error.bars.by(Response, Condition, by.var = TRUE, eyes = FALSE,
 lines = FALSE, ylim = c(0,8), 
 xlab = "Condition", ylab = "Political Attitude Score or Difference",
 main = "Confidence Intervals for Ctl and Trt\n and for Mean Difference")
})

### plot confidence interval in Mean Difference of Control and Experimental
diff.stats <- data.frame(mean=0.68,se=0.218,n=c(61))
rownames(diff.stats) <- c("Difference")
par(new = TRUE)
error.bars(stats=diff.stats,xlab="",ylab="", 
           eyes = FALSE, ylim = c(0,8), main = "")

Now we come to some interesting graphics. I simply wanted to show you where the t distributions in Figure 6.3 came from. I used the function dt(x, df = 5) to ask for the density (height) of the t distribution of varying degrees of freedom. You can't tell the function to usemeans5[i] <- mean(sample(y,5) an infinite df, so I set it at 1000. You couldn't possibly see the difference been that one and one that really did have an infinite number of degrees of freedom. At that point the t and normal distributions are identical.


#Plotting t distribution for various df

par(mfrow = c(1,1))
curve(dt(x, df = 1000), xlim = c(-5,5), ylim = c(0, .5))
par(new = TRUE)
curve(dt(x, df = 30), xlim = c(-5,5),ylim = c(0, .5))
par(new = TRUE)
curve(dt(x, df = 5), xlim = c(-5,5), ylim = c(0, .5))
legend(.60, .47, "df", bty = "n", cex = 1.7)
legend(.88, .45, "___", bty = "n")
legend(.83, .43, expression(infinity), cex = 1.4, bty = "n")
legend(.8, .41, "30", bty = "n", cex = 1.3)
legend(1, .39, "5", bty = "n", cex = 1.3)

Now we finally get serious about using t, confidence intervals, and effect sizes. Before I do that, the accompanying code does a couple of other things. One is to repeat the test on Everitt's data and another is to show how samples bounce around with repeated sampling. The latter is similar to Figure 6.5, where we repeated draw from a "population" that reflects the original data. Then we continue by looking at a t test on two independent samples, plotting confidence intervals on mean differences, calculating d, and plotting confidence intervals on d. I don't show those graphics here, but you can see them just by running the code. Finally, I took some data from a study by Helzer and Pizarro (2011), which was another look at the concept of moral purity. (These data are discussed in the text.) I presented a t test on those data. Here I have drawn the "forest plot" of the amalgamated results, but I have not shown the "funnel plot" because I have not described it and won't until Chapter 17.


# Figure 6.7 Plotting confidence intervals over repeated sampling 

j <- 24
xx <- seq(-5,25,.10)    # Setting up the axes
yy <- seq(2,9.5,.025)
n <- 17                 # Sample size
plot(yy,xx, type = "n", xlab = "confidence Interval", 
     ylab = "Sample", ylim = c(0,25),bty = "n", xlim =(c(-5, 25)))
lines(c(7.26, 7.26),c(25,0))
legend(6.5, 25, legend = expression(mu),  bty = "n")
for (i in 1:25) { 
   x <- rnorm(n,7.26,7.16)
   t <- t.test(x)
   se <- sd(x)/sqrt(n)
   b <- qt(c(.025, .975),n-1)  # b[1] = lower t cutoff,  b[2] = upper
   lower <- mean(x)+ b[1]*se
   upper <- mean(x)+ b[2]*se
   
   lines(c(lower,upper),c(j,j), col = "red")
   j <- j-1
}
lines(c(0, 0),c(0,25), col = "green")

#####################################################################


### plot confidence interval for Control and Experimental
### conditions.
library(psych)
par(mfrow = c(1,2))
error.bars.by(Response, Condition, by.var = TRUE, eyes = FALSE,
  lines = FALSE, ylim = c(0,8), main = "95% CI by Group",
   xlab = "Condition", ylab = "Political Attitude Score")

#####################################################################

### plot confidence interval for Control and Experimental
diff.stats <- c(0.68, 0.218, 61)
diff.stats <- data.frame(mean=0.68,se=0.218,n=c(61))
rownames(diff.stats) <- c("Difference")
error.bars(stats=diff.stats,xlab="",ylab="Mean Difference", 
 eyes = FALSE, ylim = c(0,8),main = "95% CI \non Mean Difference")

#####################################################################
### CI on d
library(MBESS)
conf.intervald <-ci.smd(smd = .80, n.1 = 31, n.2 = 30, conf.level = .95)
cat("The confidence interval on our estimate effect size is \n")
print(conf.intervald)

### Plotting CI on d using plotrix
library(plotrix)
plotCI(x = 0.80, y = 0, ui = 1.319, li = 0.275, err = "x", ylim = c(-5,5),
  xlab = "Effect Size", main = "95% CI on Effect Size d", 
   ylab = "",slty= "dashed", xlim = c(0,1.5))
legend(0.2, 4, legend = "d = 0.80", bty = "n")
legend(0.2, 3, legend = "Upper Limit = 1.319\nLower Limit = 0.275", bty = "n")

#####################################################################

# Zhong, Bohns, Gino  Good lamps
lightmean <- 7.78;  dimmean <- 11.47
lightsd <- 3.09; dimsd <- 4.32
n <- 42

# (lightsd = desired sd  but sd(light = obtained sd))
light <- rnorm(n, lightmean, lightsd)
light <- light*lightsd/sd(light)
light <- light - (lightmean - mean(light))
light <- light-(mean(light-(lightmean)))
mean(light); sd(light)                 

dim <- rnorm(n, dimmean, dimsd)
dim <- dim*dimsd/sd(dim)
dim <- dim - (dimmean - mean(dim))
dim <- dim-(mean(dim-(dimmean)))
mean(dim); sd(dim) 


t.test(light, dim)
t.test(light, dim, var.equal = TRUE)

	Welch Two Sample t-test

data:  light and dim
t = -4.5024, df = 74.25, p-value = 2.447e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.322918 -2.057082
sample estimates:
mean of x mean of y 
     7.78     11.47 


############


That is enough for now. The file Chapter6.R contains a few additional things that I have not discussed here.

GreenBlueBar.gif GreenBlueBar.gif dch: