Hypothesis Testing

9/18/01

Announcements:

  •  Hand back papers from Thursday.
  • There seemed to be problems on Thursday--straighten them out (see below).

Thursday's lab

The lab was intended to highlight the general theory of hypothesis testing, without using a specific test.

General outline

Of the 72 participants who died from AIDS over the course of one year, 26/72 = 36% were in the group treated with Ritonavir, and 46/72 = 64% were in the group that received a placebo.

Null hypothesis: Ritonavir does not alter the probability of dying beyond what would be the probability in the control group

Put differently, if you die, you are as likely to come from the control group as from the Ritonavir group.

To know whether a 36%/64% split is very unusual when the null hypothesis is true, we want to "model" the results we would expect under the null.

  • IF the drug had absolutely no effect, we would expect that 50% (actually 49.82%) would fall in the drug group and 50% (actually 50.18%) would fall in the control group.
    • That means we would expect 36 deaths to have come from the drug group and 36 deaths from the control.
    • This is what would happen IF the null were true.
  • We know that in actual practice there will be variability around those values (perhaps 34 of the deaths would be in the drug group, or maybe even only 30.)
  • We want to know how likely it is that we would have an experiment where there is absolutely no drug effect, and yet we get a 26/46 split.

So, if the null is true, ideally 50% of the deaths should be from the control group, and 50% from the Ritonavir group.

We want to model the kind of results we would expect from a 50/50 split.

Draw 72 cases, corresponding to the 72 deaths.

Randomly assign those deaths to the Control and Ritonavir group with a 50/50 probability of falling into each group.

Do this repeatedly, corresponding to a "hypothetically" large number of replications of the experiment.

Determine the distribution of the number of deaths assigned to the Ritonavir group.

Calculate the frequency distribution of the number of deaths per experiment:

 

Plot the resulting histogram:

Count the number of experiments that had results as extreme as the ones we had--this is, the number of experiments with as few as 26 deaths in the Retonavir group as as many as 46.

Easiest to see from frequency distrib.

0 - 26 = 9

46 - 999 = 10

Number more extreme = 9 + 10 = 19

Probability of a result at least as extreme under the null hypothesis = 19/1000 = 0.019 = 01.9%

This is a result that is very unlikely to occur by chance if the null hypothesis is true, so reject H0 .

I did this using 100,000 samples, instead of 1000, and came up with the following. It is presented only because it confirms the legitimacy of the 1000 sample case.

This distribution is shifted slightly to the right because of the way it is plotted around the midpoint of an interval. No problem.

(When I allowed for fewer intervals, I got a very strange distribution where every other frequency is high. This must have to do with rounding and with the way the random number generator works.)

Theoretical calculation

  • A glance at the figure will show that this is very close to a normal distribution.
  • I know, from material that we have not covered, that the mean of the distribution with an infinite number of samples would be 36, and the standard deviation is 4.2426.
  • We could translate "26 deaths" to a z score, and use that to calculate the area under the normal distribution to the left of that z score.

  • From tables of the normal distribution, the probability of a z score less than -2.357 is .009.
  • For a two-tailed test of z > |-2.257| we double this to .018.
    • Emprically I got p = .019, and theoretically I got .018. Close enough.
  • Emphasize this calculation and where the pieces, and the probability, come from.

 

Hypothesis Testing

For the last few years, psychologists have been all up in arms over the issue of whether or not hypothesis testing is a moral, ethical, honest way to make a living. The question has still not been resolved, but the report of the APA task force is at 


Basically, they explored those situations in which it makes sense to test a null hypothesis, and those situations in which it does not. This is an excellent article, and one which I would recommend that all students (and faculty) read carefully. The committee has done a good job of steering a middle course between those who wanted to ban hypothesis tests from the journals (an idea I consider crazy) and those who wanted to maintain the status quo.

I am not going to develop the arguments here, but the document is available for those who want to read it. One of the things that I have started doing, however, is to spend more time on looking at effect-size measures. We will see them throughout the course. The idea is to say something more than "The difference is significant."

I started out by talking about the lab that we did on Thursday. Next I will go to a study reported in Hoaglin, Mosteller, and Tukey (1983), on beta-endorphins and their role in pain.

Thursday's lab was a true hypothesis test, but without using any standard statistical procedure. It was the "ideal world" equivalent that other tests aim for. It also ties nicely to some statistical methods that I plan to introduce this semester--starting with the Hoaglin study.

Conclusions

  • From these results we know that when the drug is totally ineffective, the probability of a result as extreme as 26 has a (one-tailed) probability of .009, and a (two-tailed) probability of .01. (.01 is not twice .009 simply because of sampling error.)
  • Therefore such outcomes are extremely unlikely to arise if the drug is ineffective.
  • Therefore it is more reasonable to assume that the 26/46 split did not come from a situation where the drug is ineffective.
  • That means that we will conclude that these results came from a situation where the drug is not ineffective. Therefore the drug is effective to some extent.
  • We will conclude that the drug has some effect in reducing the incidence of death among AIDS patients.
    • In fact, Ritonavir is one of the truly effective drugs, though it is most effective when it is used in combination with a suite of other drugs.

Key Concepts

The following concepts came up, directly or indirectly, in the last example. Explain each of them in turn.

  • Null hypothesis (H0)
    • The hypothesis that the probability of death under Ritonavir is equal to the probability of death under the control condition.
    • i.e. p(Ritonavir) = p(drug) = .50
  • Alternative hypothesis (H1)
    • This is the hypothesis that is the contradiction of the null
    • It is the hypothesis that the drug group has a lower death rate (or a different death rate).
  • Research hypothesis
    • This is the hypothesis that we started out to investigate
    • It is almost always aligned with the Alternative hypothesis.
  • One-tailed test
    • This tests the hypothesis that the death rate in the drug group is lower than the death rate in the control group.
    • A second one-tailed hypothesis is that the death rate is higher than the rate in the control group.
      • Here that wouldn't make any sense--or would it?.
  • Two-tailed test
    • This tests the hypothesis that the death rate in the drug group is more extreme than the death rate in the control group.
  • One- versus two-tailed tests.
    • I'm not going to get into this argument.
    • For this course, we will almost always use two-tailed tests.
  • Type I error
    • The probability that we will falsely reject a true null hypothesis.
      • In this case, we know that 6 times out of 1000 we will actually have a true null but will get a 26 and reject.
    • We usually set a probability (e.g. .05) as a critical value.
      • When p < .05 we reject
      • When p > .05 we don't reject
    • This latter approach sets the probability of a Type-one error at .05.
    • The probability of a Type I error is generally represented by alpha (a)
  • Type II error
    • The probability of falsely retaining a false null hypothesis.
    • We can't calculate this for most cases because we need to know how "effective" the drug is.
      • give example
      • If the drug actually means that 48 versus 52% of the deaths really should fall in the drug group, we are not likely to detect that.
        • In our example, that would round off to expecting 34.56 = 35 deaths, which has an empirical probability of .436.
        • Clearly, that result would not lead us to reject.
      • But it would be easy to detect an effect if the drug cures all AIDS cases.
      • This represents the fact that the probability of a Type II error depends heavily on just how effective the drug is--or just how much of a treatment effect we really have.
    • We did sort-of look at this in the lab when we drew samples where we expected the Ritonavir group to do twice as well. We could look to see how often those results produced 26 deaths in the Ritonavir group.
    • Briefly mention power and the effect of sample size.
    • This probability is generally denoted by beta (b)
  • Decision making
    • Draw the standard diagram on the board.

     

    •  Decision True State of the World
        H0 True H0 False
      Reject Type I Power
      Retain Correct Type II

A Second example

  • Example from Hoaglin, Mosteller, and Tukey (1983)
  • Review the basic idea behind the experiment
    • Patients were measured for beta-endorphin levels 12 hours, and again 10 minutes, before surgery.
    • There were 19 patients
    • This represents a repeated-measures study (often called paired samples or matched groups.)
  • The data are below. I made 2 tiny changes to eliminate differences of 0.0

  • Patient 12 hrs 10 min Differ
    1 10.0 6.5 -3.5
    2 6.5 14.0 7.5
    3 8.0 13.5 5.5
    4 12.0 18.0 6.0
    5 5.0 14.5 9.5
    6 11.5 9.0 -2.5
    7 5.0 18.0 13.0
    8 3.5 42.0 38.5
    9 7.5 7.4 -0.1
    10 5.8 6.0 0.2
    11 4.7 25.0 20.3
    12 8.0 12.0 4.0
    13 7.0 52.0 45.0
    14 17.0 20.0 3.0
    15 8.8 16.0 7.2
    16 17.0 15.0 -2.0
    17 15.0 11.5 -3.5
    18 4.4 2.5 -1.9
    19 2.0 2.1 0.1
    Mean     7.7
    st. dev.     13.519
  •     What would students conclude just from looking at these data?
    • I would conclude that since most of the patients had higher endorphin levels just before surgery, the body must be increasing its production of endorphins in response to stress.
    • But perhaps this is just a fluke.
  • We could create a model of what we would expect if the null were true.
    • If the null were true, we would expect that the probability of a score going up would equal the probability of a score going down = p = .50.
    • Moreover, the magnitude of the positive scores should equal the magnitude of the negative scores, on average.
  • How could we test this hypothesis?
    • Start with what would happen if H0 were true
      • The probability of the 12 hr score being higher than the 10 min score would be .50, and vice versa
      • That means that about half of the difference scores would be positive and half negative.
      • That means that, on average, the mean difference score would be 0.
      • We can compare our mean difference score with 0.
    • Then we could figure out what the distribution of means of difference scores would look like, and compare our obtained mean difference to that.
    • There are several ways to do this, and they all lead to slightly different tests.
      • 1.  We could assume that we were sampling from a normal distribution of difference scores with a mean of 0 and a standard deviation = ???
      • 2.  We could assume that we were sampling from a normal distribution of difference scores with a mean of 0 and a standard deviation estimated by the standard deviation of our sample (13.519).
      • 3. We could do something like we did on Thursday based upon the assumptions in #2
        •  In other words, draw 19 scores from a normal distribution of mean 0 and sd = 13.519
        • Calculate their mean.
        • Repeat this process 1000 times and plot the result.
        • Compare our mean to the means we get when H0 true
        • Reject or retain the null.
        • This is not hard to do, but it is awkward in SPSS, so I didn't do it.
      • 4.  We could use a formula to tell us exactly what we would get if we did what I just described.
        • That would calculate a statistic that measures the distance between what we found (mean = 7.7) and what we would find if H0 were true, and express it as a function of the st. dev.
        • This is what a t test actually does.
        • Note: this says that the t test is really just a formulaic way of finding out what would happen if we drew all those samples.
        • When we do it this way, the probability of the data given the null = .023
      • 5. There is a problem with #3 and #4, in that they have us draw our samples from some normal population. But who said that the population of difference scores under the null would be normal?
        • Perhaps it is logarithmic, or exponential, or something else.
      • 6.  An Alternative
        • If the null were true, the 12 hour score is just as likely to be greater than the 10 min as it is to be less. 
          • Therefore, under H0 the difference score is just as likely to be + as -
          • We could model this by taking our difference scores and randomly assigning + and -
          • This would give a set of difference scores, and a mean difference, that is just as likely as the set we got if the null is true.
          • We could repeat this a very large number of times.
          • Then we could compare our obtained mean difference against this.
        • This has the advantage that we are not assuming that our distribution of differences is normally distributed.
        •  As you can see below, it is pretty hard to argue that the differences are normally distributed.
        • What I have done is to draw many samples (2000) where I let the sign of the difference be chosen randomly.
        • I could have simply plotted the means of the differences, but I chose to plot something which is a function of that. I divided each mean difference by its standard error, which is a function of the standard deviation of the differences.

    My results follow:

These results show the t values (my statistic) that we would expect if the null were true. It also shows the location of the obtained statistic ( = 3.354) and the probability of being more extreme than that ( = .003).

Conclusions from this example

  • We have to set up a model that reflects the null hypothesis.
    • There is not just one legitimate model
  • Possible models
    • Data were drawn from a normal distribution
    • Data were drawn from some other specified distribution
    • Data were drawn from a distribution that looks like ours except for sign
      • Signs are arbitrary.
  • Each of these leads to some sort of (at least imaginary) sampling study
  • The t test that we will discuss in about 2 weeks is based on the first model
    • except that we do things by formula rather than by actual sampling.
  • An important field of statistics (Resampling statistics) is based on the third model.
  • There is no "right" approach.
    • A lot of the early statisticians, and many of the current ones, thought of the first model as an imperfect way of estimating what we would find under the last model.
    • They justified traditional parametric tests on the grounds that they did a good job of approximating the third model
    • The third model has only become practical once we got computers that could draw thousands of samples almost instantly.

Terminology

  • Parameters vs statistics
    • They should already have seen these.
    • Ask about the difference between populations and samples.
    • We know that the statistics (sample means) differ (They refer to actual data)
    • We want to know if the parameters (µ’s) differ
    • Expand on this
  • Random Sample of cases
    • Ask why we care. What are we taking a random sample of?
      • We are taking a random sample of mice, or at least randomly assigning our mice to the conditions.
      • We are not taking a random sample of drug levels, although we will later speak of designs in which we do that.
  • External Validity
    • With changes in the way we conduct research with human subjects, it is getting harder and harder to use random sampling, even among a fairly limited population.
    • We could have taken a random sample of the levels of the independent variable. Ask if that’s what they think we did. What difference would it have made?
  • Random assignment
    • ASK why we care about random assignment
    • ASK how it relates to Internal Validity
  • Research hypothesis (Alternative hypothesis) (H1)
    • ASK them to specify the research and null hypotheses
    • The research hypothesis could be that the groups differ in the number of errors they make. Or it could be that the dose response curve (define) increases up to a point and then drops. These are quite different hypotheses, and McClelland has argued that the sample sizes we use depend on our particular hypothesis--but I won’t go into that here.
    • ASK how we might design the study if we thought that increases in dose lead to continuous increases in recall (at least up to some reasonable limit.)
  • Null Hypothesis (H0)
    • The groups do not differ
  • Sampling error
    • They saw this last Thursday when they drew 1000 random samples of deaths in Ritonavir group due to AIDS. They’ll see this again in a minute.
    • The change (or instability) in some statistic (e.g. the mean) over repeated sampling.
  • Sampling distribution of the mean
    • This is the distribution that would result if you drew an infinite number of samples and plotted their means—which they just did when we talked about the first model above.
    • We use sampling distributions because they tell us what kinds of values to expect for a statistic under certain conditions.
    • Insert normal distribution here
      • This is the sampling distribution of the mean, so the X axis really shows sample means, not individual observations.
      • I'm not happy with the section--I need to think about it.

     

    This is already much too long. I'll cut it here--and I bet I don't get this far.

    Last revised: 09/18/01