# Power Analysis

## Announcements:

• Are there questions about the exam?
• How did it go?
• Are there any topics in Chapter 7 that students want me to go over?

## Power:

• I have spent a lot of time on testing the null hypothesis, and a lot of time on the idea of a Type I error and a, the probability of a Type I error.
• But we also need to worry about the probability of failing to reject the null hypothesis when it is false. The is a Type II error, and its probability is denoted by b.
• Power is defined as the reverse of a Type II error. It is the probability of rejecting the null hypothesis when it is false. As such, it is defined as Power = 1 -  b.
• There is one major problem here. When we are worrying about Type I errors, we know the mean (or mean difference) if the null is true. It is usually 0. We can then go about calculating the probability of getting a significant result when the mean, or the difference between means, is 0.
• The problem with a Type II error, and power, is that if the null is false, we don't know how false it really is. Maybe the mean is 0.7, or maybe it is 5.9, or maybe it is 59.6. And it obviously makes a difference how false it is when it comes to calculating power.
• SO, we have another statistic to worry about, which in the last chapter we called the Effect Size. The effect size is simply a measure of the degree to which the null hypothesis is false.
• We cannot calculating the power of finding an effect until we know how big that effect is. In other words, we first need to know the effect size.
• We will have many different measures of effect size, depending on the statistical text we are using, but they all come down to the same issue of measuring the degree to which the null hypothesis is false, often in standard deviation units.
• As I have said before, effect size has taken on additional importance in recent years, apart from issues of power.
• APA has come out with a report (http://www.apa.org/journals/amp/amp548594.html ) urging editors and researchers to insist on effect size measures in research reports.
• That was why I made a point of including it on the exam.

### An Example:

Adams, Wright, & Lohr. (1996) Is homophobia associated with homosexual arousal? Journal of Abnormal Psychology, 105, 440-445.

The authors exposed homophobic and nonhomophobic heterosexual men to videotapes of sexually explicit erotic stimuli consisting of heterosexual, male homosexual, and lesbian videotapes. They recorded the participants' level of sexual arousal (in the standard way).

I believe that they defined homophobic as extreme on a scale that they devised, but I may be wrong. I will only work with the data from the homophobic video, just to make the example simpler. I don't know the exact standard deviations, but I can make a pretty good guess by running the t test backwards to find the value of the standard error that would give the that t, and then converting to a standard deviation. The means are taken from their graph.

I have drawn a set of data that are similar to theirs, though the means are not exact. The independent t test follows. Group 1 represents the homophobic subjects, while Group 2 is the non-homophobic subjects. I have imposed equal sample sizes for our convenience and added 12 points to the means so that the observations would not come out negative. (It pains me to think what a negative number on this measure of arousal would represent.) (I have cheated a bit by making the n's equal, but not much. In other pages using this example you will find sample sizes of 35 and 29, with some slight difference in t.)

(Adams et al found means of 24.00 and 16.5. Mine differ from that only because I used random samples.)

The important question is:

Suppose that Adams et al. had it exactly right, and the two population means were 24.00 and 16.5, and the (pooled) population standard deviation was 12.03. What is the probability that they would get a significant result when running a study like this?

To put this another way, what is the power of this study if the parameters are as we think they are?

First, let's look at the problem in a crude brute-force way. Suppose that we created populations with exactly these parameters, ran 100 studies from those populations, and saw how often we rejected the null hypothesis?

This is really not a bad way of solving the problem. In fact, perhaps there are situations where it is clearly the simplest way to solve the problem.

SPSS is ideal for this task, as they will see on Thursday. We can just write a simple syntax file to do our calculations, and then look at the results.

 SPSS Syntax ```new file. input program. loop #1 = 1 to 64.      if \$casenum le 32 group = 1.      if \$casenum ge 33 group = 2.      do repeat response = r1 to r100.           compute response = rv.normal(0,1).           if group = 1 response = response*12.03 + 24.           if group = 2 response = response*12.03 + 16.5.      end repeat.      end case. end loop. end file. end input program. save outfile "adamsequal.sav".``` ```T-TEST GROUPS=group(1 2) /MISSING=ANALYSIS /VARIABLES=r1 to r100```

I then entered all of the resulting t values, plus the 100 t values from last year, to create a file of 200 tests. The frequency distribution is given below:

(Well, I was going to put it here, but it was not legible. So I have printed it out.) It can be found at t values--frequency distribution.

Using SPSS, I calculated that the critical (two-tailed) value of t.025 on 62 df = +1.99897. At the 1% level it is +2.657.

(Explain why I say that I used SPSS to accomplish this calculation.)

Looking at the histogram of these results and using (my copy of) the frequency distribution.

31.5% of the t values are less than 1.99897, and 56.5% of them are less than 2.657.

This means that if the null hypothesis is false to exactly the extent that we believe, 31.5% of our experiments will not be significant at a = .05, and 56.5% will not be significant at a = .01.

This means that 1/3 to 1/2 of the time we will fail to get a significant result, even though there is a difference between the groups.

This is actually looking at the problem backward. We are talking about power, not Type II errors. So what we should really say is that at the .05 level we have a probability of 1 - .315 = .685 of rejecting the null. This is the power of the experiment. At alpha = .01, the probability of rejecting the null is 1 - .565 = .435.

This t distribution is an example of what is called a noncentral t distribution, because it is not centered at 0, but at some value different from (greater than) 0.00

### The Centralt distribution.

Suppose that the null hypothesis were really true. What would the t distribution look like if I drew 200 pairs of samples under those conditions?

( I have superimposed a normal distribution, which is not all that far off from what the true distribution would look like. It would be exact for infinite sample sizes.)

Note how this distribution differs from the previous one. It has just about the same shape and standard deviation, but its mean is approximately 0, whereas the other had a mean of 2.48. In other words, the distribution when H0 is is false is displaced to the right.

I can plot the two distributions relative to each other. First I will replot the one above. They don't line up exactly, but you can imagine what they would look like.

The degree to which the second distribution is displaced relative to the first is called the noncentrality parameter, which is d .

### Direct Calculations

We have just used the brute-force method of calculating power. There should be a more elegant way, and there is.

Power as a function of:

Effect size  The effect size is the magnitude of the degree to which the null hypothesis is false.

Get them to tell me what they think the effect size might be in this example.

First, it has to depend on the difference between the two means.

Second, a difference between means must be expressed relative to the size of the standard deviation.

Third, we could include the sample size in there, but that is not really a measure of how false the null is, but more a measure of how powerful our experiment is. So we will hold that off until later.

where s = the estimated standard deviation of the population(s).

Because we have estimates of these parameters, we can insert these estimates in the formula.

This tells us how far apart the means of the populations are, scaled by the size of the standard deviation. In other words, the means are 0.62 standard deviations apart.

That tells us a lot of what we want to know, but it doesn't take sample size into account. But that is simple.

Define a new statistic (d) which includes the sample size. (I have used n = 32 because that is the average sample size, and it is very close to the harmonic mean of the sample sizes, which is technically better as an estimate.)

(Notice that this value of d is the same as our value of t.) To evaluate d we need to go to tables of power, one of which is given in the text in the Appendices. (p. 679). This table follows.

 APPENDIX POWER: POWER  AS A FUNCTION OF d AND SIGNIFICANCE LEVEL (a) Two-tailed a d .10 .05 .02 .01 1.00 .26 .17 .09 .06 1.10 .29 .20 .11 .07 1.20 .33 .22 .13 .08 1.30 .37 .26 .15 .10 1.40 .40 .29 .18 .12 1.50 .44 .32 .20 .14 1.60 .48 .36 .23 .17 1.70 .52 .40 .27 .19 1.80 .56 .44 .30 .22 1.90 .60 .48 .34 .25 2.00 .64 .52 .37 28 2.10 .68 .56 .41 .32 2.20 .71 .60 .45 .35 2.30 .74 .63 .49 .39 2.40 .78 .67 .53 .43 2.50 .80 .71 .57 .47 2.60 .83 .74 .61 .51 2.70 .85 .77 .65 .55 2.80 .88 .80 .68 .59 2.90 .90 .83 .72 .63 3.00 .91 .95 75 .66 3.10 .93 .87 .78 .70 3.20 .94 .89 '81 .73 3.30 .95 .91 .84 .77 3.40 .96 .93 .86 .80 3.50 .97 .94 .88 .82 3.60 .98 .95 .90 .85 3.70 .98 .96 .92 .87 3.80 .98 .97 .93 .89 3.90 .99 .97 .94 .91 4.00 .99 .98 .95 .92 4.10 .99 '98 .96 .94 4.20 - .99 .97 .95 4.30 - .99 .98 .96 4.40 - .99 .98 .97 4.50 - .99 .99 .97 4.60 - - .99 .98 4.70 - - '99 .98 4.80 - - .99 .99 4.90 - - - .99 5.00 - - - .99

Table from Howell, D. C. (1997) Statistical Methods
for Psychology (4th ed.) Belmont, CA: Duxbury.

For d = 2.48, the table gives the power as approximately .70 for a two-tailed test at a = .05.

This says that if the parameters are as we expect them to be, we would expect to reject the null hypothesis 70% of the time when we run this experiment. In fact, in our sampling study we rejected it 1 - .315 = 68.5% of the time, which is certainly close.

Other Effect Sizes

For the two-sample test that we just did:

For two-sample t test with unequal sample sizes: (Adams actual sizes were 35 and 29)

For a one-sample t test:

For correlation with two variables,

Estimating effect sizes without knowing the parameters.

Cohen (1988) gives very rough guidelines about estimating effect sizes. I give a table in the book, which looks as follows.

 Effect Size d Percent Overlap Small .20 85 Medium .50 67 Large .80 53

Explain this table in terms of the overlap of two normal distributions.

Notice that by this table our effect is somewhere between a medium and a large effect.

If students need to work with power, they should see Cohen's book. It provides a huge amount of material on each type of hypothesis test.

Cohen also presents values that are slightly more accurate than the approximations that I give, but the differences are small.

### Post-Hoc Power

In the last few years there has been an increase in the use of power analyses. (SPSS provides this under the heading "Observed Power," which might be a better label.) This has had two effects.

• An increase in the number of people who calculate the power of their experiment before they begin (or at least before they submit their grant and/or go to the appropriate university committees).
• An increase in journal editor's request for an after-the-fact estimate of what the power of the experiment was.
• We have basically been working with the latter. A couple of guys did a study, and we asked "If their results reflect the population, what probability did they have of finding a significant result?"

There is a lot of debate over this approach, but it is gaining acceptance.

I have a lot of trouble with Post Hoc power, especially when the null is not significant. Talking about power here would be like saying "The difference in my data is not reliable, but if it were, then ..." Do we really want to say that? Sometimes we do, but how do we tell when the "sometimes" has occurred?

Software, such as SPSS, can often be found to calculate post-hoc power. You have to hunt for it, but it's there. (General Linear Model./Options)

Instead of looking at the t test on these means, we could run an Anova. If we did it with the one-way Anova in the Compare Means menu, we could not get power. But if we used the General Linear Model/Univariate approach, we'd be able to select power analyses, along with lots of other goodies.

Notice that their estimate of power agrees nicely with my estimate of .70.

Their "noncentrality parameter" of 6.219 isn't even close to mine (= 2.48). That's because it is the noncentrality of the F distribution. If we took the square root of it (=2.493) we would have something very close to mine.