Adams, Wright, & Lohr. (1996) Is homophobia associated with homosexual arousal? Journal of Abnormal Psychology, 105, 440-445.
The authors exposed homophobic and nonhomophobic heterosexual men to videotapes of sexually explicit erotic stimuli consisting of heterosexual, male homosexual, and lesbian videotapes. They recorded the participants' level of sexual arousal (in the standard way).
I believe that they defined homophobic as extreme on a scale that they devised, but I may be wrong. I will only work with the data from the homophobic video, just to make the example simpler. I don't know the exact standard deviations, but I can make a pretty good guess by running the t test backwards to find the value of the standard error that would give the that t, and then converting to a standard deviation. The means are taken from their graph.
I have drawn a set of data that are similar to theirs, though the means are not exact. The independent t test follows. Group 1 represents the homophobic subjects, while Group 2 is the non-homophobic subjects. I have imposed equal sample sizes for our convenience and added 12 points to the means so that the observations would not come out negative. (It pains me to think what a negative number on this measure of arousal would represent.) (I have cheated a bit by making the n's equal, but not much. In other pages using this example you will find sample sizes of 35 and 29, with some slight difference in t.)
(Adams et al found means of 24.00 and 16.5. Mine differ from that only because I used random samples.)
The important question is:
Suppose that Adams et al. had it exactly right, and the two population means were 24.00 and 16.5, and the (pooled) population standard deviation was 12.03. What is the probability that they would get a significant result when running a study like this?
To put this another way, what is the power of this study if the parameters are as we think they are?
First, let's look at the problem in a crude brute-force way. Suppose that we created populations with exactly these parameters, ran 100 studies from those populations, and saw how often we rejected the null hypothesis?
This is really not a bad way of solving the problem. In fact, perhaps there are situations where it is clearly the simplest way to solve the problem.
SPSS is ideal for this task, as they will see on Thursday. We can just write a simple syntax file to do our calculations, and then look at the results.
I then entered all of the resulting t values, plus the 100 t values from last year, to create a file of 200 tests. The frequency distribution is given below:
(Well, I was going to put it here, but it was not legible. So I have printed it out.) It can be found at t values--frequency distribution.
Using SPSS, I calculated that the critical (two-tailed) value of t.025 on 62 df = +1.99897. At the 1% level it is +2.657.
(Explain why I say that I used SPSS to accomplish this calculation.)
Looking at the histogram of these results and using (my copy of) the frequency distribution.
31.5% of the t values are less than 1.99897, and 56.5% of them are less than 2.657.
This means that if the null hypothesis is false to exactly the extent that we believe, 31.5% of our experiments will not be significant at a = .05, and 56.5% will not be significant at a = .01.
This means that 1/3 to 1/2 of the time we will fail to get a significant result, even though there is a difference between the groups.
This is actually looking at the problem backward. We are talking about power,
not Type II errors. So what we should really say is that at the .05 level
we have a probability of 1 - .315 = .685 of rejecting the null. This is the
power of the experiment. At alpha = .01, the probability of rejecting the null
is 1 - .565 = .435.
This t distribution is an example of what is called a noncentral t distribution, because it is not centered at 0, but at some value different from (greater than) 0.00
Suppose that the null hypothesis were really true. What would the t distribution look like if I drew 200 pairs of samples under those conditions?
( I have superimposed a normal distribution, which is not all that far off from what the true distribution would look like. It would be exact for infinite sample sizes.)
Note how this distribution differs from the previous one. It has just about the same shape and standard deviation, but its mean is approximately 0, whereas the other had a mean of 2.48. In other words, the distribution when H0 is is false is displaced to the right.
I can plot the two distributions relative to each other. First I will replot the one above. They don't line up exactly, but you can imagine what they would look like.
The degree to which the second distribution is displaced relative to the first is called the noncentrality parameter, which is d .
We have just used the brute-force method of calculating power. There should be a more elegant way, and there is.
Power as a function of:
Effect size The effect size is the magnitude of the degree to which the null hypothesis is false.
Get them to tell me what they think the effect size might be in this example.
First, it has to depend on the difference between the two means.
Second, a difference between means must be expressed relative to the size of the standard deviation.
Third, we could include the sample size in there, but that is not really a measure of how false the null is, but more a measure of how powerful our experiment is. So we will hold that off until later.
where s = the estimated standard deviation of the population(s).
Because we have estimates of these parameters, we can insert these estimates in the formula.
This tells us how far apart the means of the populations are, scaled by the size of the standard deviation. In other words, the means are 0.62 standard deviations apart.
That tells us a lot of what we want to know, but it doesn't take sample size into account. But that is simple.
Define a new statistic (d) which includes the sample size. (I have used n = 32 because that is the average sample size, and it is very close to the harmonic mean of the sample sizes, which is technically better as an estimate.)
(Notice that this value of d is the same as our value of t.) To evaluate d we need to go to tables of power, one of which is given in the text in the Appendices. (p. 679). This table follows.
APPENDIX POWER: POWER
AS A FUNCTION OF d AND SIGNIFICANCE LEVEL (a)
d .10 .05 .02 .01 1.00 .26 .17 .09 .06 1.10 .29 .20 .11 .07 1.20 .33 .22 .13 .08 1.30 .37 .26 .15 .10 1.40 .40 .29 .18 .12 1.50 .44 .32 .20 .14 1.60 .48 .36 .23 .17 1.70 .52 .40 .27 .19 1.80 .56 .44 .30 .22 1.90 .60 .48 .34 .25 2.00 .64 .52 .37 28 2.10 .68 .56 .41 .32 2.20 .71 .60 .45 .35 2.30 .74 .63 .49 .39 2.40 .78 .67 .53 .43 2.50 .80 .71 .57 .47 2.60 .83 .74 .61 .51 2.70 .85 .77 .65 .55 2.80 .88 .80 .68 .59 2.90 .90 .83 .72 .63 3.00 .91 .95 75 .66 3.10 .93 .87 .78 .70 3.20 .94 .89 '81 .73 3.30 .95 .91 .84 .77 3.40 .96 .93 .86 .80 3.50 .97 .94 .88 .82 3.60 .98 .95 .90 .85 3.70 .98 .96 .92 .87 3.80 .98 .97 .93 .89 3.90 .99 .97 .94 .91 4.00 .99 .98 .95 .92 4.10 .99 '98 .96 .94 4.20 - .99 .97 .95 4.30 - .99 .98 .96 4.40 - .99 .98 .97 4.50 - .99 .99 .97 4.60 - - .99 .98 4.70 - - '99 .98 4.80 - - .99 .99 4.90 - - - .99 5.00 - - - .99
Table from Howell, D. C. (1997) Statistical MethodsFor d = 2.48, the table gives the power as approximately .70 for a two-tailed test at a = .05.
for Psychology (4th ed.) Belmont, CA: Duxbury.
This says that if the parameters are as we expect them to be, we would expect to reject the null hypothesis 70% of the time when we run this experiment. In fact, in our sampling study we rejected it 1 - .315 = 68.5% of the time, which is certainly close.
Other Effect Sizes
For the two-sample test that we just did:
For two-sample t test with unequal sample sizes: (Adams actual sizes were 35 and 29)
For a one-sample t test:
For correlation with two variables,
Estimating effect sizes without knowing the parameters.
Cohen (1988) gives very rough guidelines about estimating effect sizes. I give a table in the book, which looks as follows.
Explain this table in terms of the overlap of two normal distributions.
Notice that by this table our effect is somewhere between a medium and a large effect.
If students need to work with power, they should see Cohen's book. It provides a huge amount of material on each type of hypothesis test.
Cohen also presents values that are slightly more accurate than the approximations that I give, but the differences are small.
In the last few years there has been an increase in the use of power analyses. (SPSS provides this under the heading "Observed Power," which might be a better label.) This has had two effects.
We have basically been working with the latter. A couple of guys did a study, and we asked "If their results reflect the population, what probability did they have of finding a significant result?"
There is a lot of debate over this approach, but it is gaining acceptance.
I have a lot of trouble with Post Hoc power, especially when the null is not significant. Talking about power here would be like saying "The difference in my data is not reliable, but if it were, then ..." Do we really want to say that? Sometimes we do, but how do we tell when the "sometimes" has occurred?
Software, such as SPSS, can often be found to calculate post-hoc power. You have to hunt for it, but it's there. (General Linear Model./Options)
Instead of looking at the t test on these means, we could run an Anova. If we did it with the one-way Anova in the Compare Means menu, we could not get power. But if we used the General Linear Model/Univariate approach, we'd be able to select power analyses, along with lots of other goodies.
Notice that their estimate of power agrees nicely with my estimate of .70.
Their "noncentrality parameter" of 6.219 isn't even close to mine (= 2.48). That's because it is the noncentrality of the F distribution. If we took the square root of it (=2.493) we would have something very close to mine.
Last modified: 10/22/01