t-tests Continued

10/16/01

Announcements

Discuss exam

Distinguish between odds and odds ratio

Review how to read ASCII data. Students can practice on demo.dat.

Introduction to t tests for two independent samples

"Introduction" is a misnomer, because we have already seen independent samples t tests last Thursday.

Here we are simply expanding what we already know to the case of two independent groups. 

It is important that the observations be independent, which almost always means that they come from two different groups of subjects.

They saw an example of this test in lab on Thursday when they looked at Early and Late bilingual speakers.

Null Hypothesis

H0: µ1 = µ2

against alternative hypothesis

H1: µ1 /= µ2 (Two–tailed)

or

H1: µ1 < µ2 or H1: µ1 > µ2 (One-Tailed)

I will almost always talk about testing two-tailed hypotheses

 Example

Jo Epping’s practicum presentation looked at cancer data. There is an argument in the literature that psychological factors play a role in survival from cancer. One important variable seems to be "avoidance" Patients with a high incidence of reported avoidance of thoughts related to cancer are believed to have a poorer outcome. The data are available at Jo.sav and Jo.dat. (If you are using Netscape and it won't open these files properly, hold down the shift key while you double click, and it will allow you to save the file on your machine.)

She divided her subjects into two groups. Group 1 was classed as a success because those subjects were cancer free at follow–up. Group 2 were classed as Failed because they were either not cancer free or had died at follow–up.

The dependent variable is the reported level of avoidance at Time 1--higher represents more avoidance.

Data

Group 1
(Success)
2
(Fail)
Mean 14.41 17.00
Variance 21.247 19.647
ni 49 18

 

Student's t test for two independent samples

A t test asks about the probability of getting a given difference between sample means when we draw samples from infinitely large  populations with equal means.

In the process, we will assume that our data were drawn from normally distributed populations with equal variances.

Assumptions:

Both samples come from populations with the same s2

(This is reasonably important, but not absolutely essential.)

Both samples come from normally distributed populations

(This is less important)

The test is robust (define) against violations of these assumptions in most cases. BUT it is not robust if you have substantially unequal sample sizes.

By making fewer assumptions, the randomization tests (which I have discussed elsewhere) are not subject to the lack of robustness with assumption violations and unequal sample sizes. they may, however, be less powerful in the case where the assumptions are true. "You buy information with assumptions."

Often we can transform our data in some way to make them meet these assumptions.

They have already seen how to use sqrt and log transforms

Psychologists worry about whether transformations are appropriate; Statisticians can’t believe that people would ever doubt that they are.

Formulae

UNPOOLED

 on ni + nj – 2 df

 

POOLED

on ni + nj – 2 df (or modified)

 

 Example

We'll go back to Jo Epping's data that we used before. Those data are reproduced below.

The dependent variable is the reported level of intrusive thoughts at Time 1

Data

Group 1
(Success)
2
(Fail)
Mean 14.41 17.00
Variance 21.247 19.647
ni 49 18

Unpooled t test

 

I have created two tables out of one by using "pivot table" features. The unequal variance table is shown above.

traditional df = 49 + 18 - 2 = 65

Tables have 50 and 100 df. We could interpolate, but even if we had only had 50 df the critical value would have been ta/2 (50) = 2.009

We can reject H0 because -2.097 is outside -2.009 — +2.009

The exact probability, as calculated from Probability Calculator, is .0397.

We can conclude that the two groups differ on the number of intrusive thoughts. The group that recovered had significantly fewer of them than did the group that did not recover. REMEMBER, THESE ARE REAL DATA.

Note: if we look at the Global Symptom Index, adjusted for medical problems, we find no difference between groups. We are looking at something more refined that just a bunch of symptoms.

However, when we do not assume that the variances are equal in the two populations, then we have to adjust our degrees of freedom. The t that we compute is a legitimate t, but not on N1 + N2 -2 df

 

If you apply this formula to our data you get df' = 31.4, which we can round to 31. In that case the critical value of t (two-tailed) would be approximately +2.04. We would still reject the null.

 

Pooled estimates

 

Here the df = N1 + N2 - 2 = 65, and the critical value would be approximately +2.00

Notice that we come to the same conclusion as did SPSS—The exact probability is .0434.

 

To pool or not to pool

First of all, we need to consider tests on the equality of variances in the two populations. In the book I talk about tests due to Levene and to O'Brien. There is some reason to think that O'Brien's test is slightly better, but not all that much better. Levene's is the test that is more often implemented in software.

You can see Levene's test above. It creates an F statistic, and prints out the significance level of that statistic under the null hypothesis of equal population variances. You can see that we don't come close to rejecting that null, because the significance level is well above .05.

Students often ask when we should pool the variances and when we should not. I am uncomfortable saying that you should pool whenever Levene's test is not significant, but I don't have a better rule to suggest.

Keep in mind that pooling doesn't do anything if you have equal sample sizes--the results will be the same whether you pool or not.

Also keep in mind that when you don't pool the variances, and then calculate an adjusted df, that adjusted df will never be smaller than the df for the smaller group. So, if the t would be significant for that minimum df, then it is significant with the adjusted df.

 

Effect Size

The American Psychological Association has recently come out with a statement that they want to see people present some measure of the magnitude of an effect. One possibility involves a squared correlation coefficient, which we will come to in Chapter 9. (I'll skip it here.) Another is Cohen's d, which probably should technically be attributed to Hedges, because Cohen's formula only included parameters.

where sp is the square root of the pooled variance estimate.

All that this formula is really doing is expressing the distance between the means of the two groups in terms of the size of the standard deviation. For example, d = .6 would simply say that the two group means were 6/10th of a standard deviation apart.

For our data:

Since we are talking about distance, and distance is always positive, I would drop the negative sign and conclude that the effect size is .57. This means that the difference in avoidence means for the two groups is a little more than half a standard deviation. This would be classed as a medium effect size.

Even if the units themselves are not particularly meaningful--and I don't get a lot of intuitive meaning out of an avoidance score of 14.41--the effect size measure has at least some meaning.

Demonstrate this in terms of overlapping distributions.

Confidence Limits

We calculated confidence limits on Thursday.

Confidence limits demarcate an interval with a known probability of including the population mean, or, in this case, the difference between two population means.

One way to think of confidence limits is in terms of how large could our true difference be and still have a reasonable chance of obtaining the sample mean difference that we obtained. We had a difference of 2.59 points. If the true means had been 35 points apart, it is hard to believe that I could have come up with a sample difference of 2.59. On the other hand, if the true difference had been 2.50, I have no trouble believing that I could get a sample difference of 2.59. We we ask how large a difference would just barely lead us to not reject the null with a sample difference of 2.59. Then we ask the same question about how small a difference would be possible.

We can see this from the formula for t

Those are very close to the SPSS results when you notice that SPSS expressed the former in E notation.

If we did not pool, we would do the same thing, except the standard error of the difference would involve the non-pooled value, and the critical value of t would change because of a change in the df.

We can roughly conclude that the probability is .95 that the true difference in mean avoidance scores between Survivors and those who are not Survivors is between .074 and 5.106.

That really is not strictly accurate. The procedure that we have used has a probability of .95 of giving us limits that include the parameter (or the difference in parameters). is a fixed (though unknown) value, and it has no sampling error. It doesn't jump into our out of our interval with some probability. It is our interval that is a random variable. As Good (1999) said, our confidence is in the method, not in the specific interval. At the same time, it is much easier to be sloppy, and not all that awful.

One thing that I have not mentioned specifically elsewhere is the close tie between the t test and confidence limits. If you are running a t test on the difference between two means, a significant difference between means will guarantee that the confidence limits will not include 0.00, and vice versa.

Going back to difference scores

I got several people worried two years ago when I spoke about the questions that have arisen about difference scores. The basic idea was that Cronbach and Furby (1970) suggested difference scores (or change scores) are bad things. I also referred to a paper by ?? that purported to show that as the reliability of the difference scores increased, the power of the test decreased.

Current thinking has raised questions about both of these conclusions. Bruno Zumbo just wrote a paper, in press, that points to a bunch of holes in the first argument, and Zumbo and others wrote an earlier paper that pointed out that the reliability paradox really depends on how you create the unreliable data.

One thing to keep in mind is that an alternative approach, which is often very similar, is to use an analysis of variance on the posttest scores, with the pretest score as the covariate. I won't even talk about Ancova until half way through next semester. There is a paper by Dugard & Todman (1995), Educational Psychology, which argues that in this case an analysis of covariance, is preferable.  In one limited case, the two approaches are exactly equal. Zumbo gives some simple rules about when one approach is likely to be better than the other. 

I don't want people to worry overly much about the problem of difference scores. I just wanted to make them aware that questions have been raised. But those questions were first raised 28 years ago, and the difference score is still alive and well.

 

Last revised: 10/16/01