|
| |

Hypothesis Testing
9/18/01
Announcements:
- Hand back papers from Thursday.
- There seemed to be problems on Thursday--straighten them out (see below).
Thursday's lab
The lab was intended to highlight the general theory of hypothesis
testing, without using a specific test.
General outline
Of the 72 participants who died from AIDS over the course of one
year, 26/72 = 36% were in the group treated with Ritonavir, and 46/72 =
64% were in the group that received a placebo.
Null hypothesis: Ritonavir does not alter the probability of dying
beyond what would be the probability in the control group
Put differently, if you die, you are as likely to come from the
control group as from the Ritonavir group.
To know whether a 36%/64% split is very unusual when the null
hypothesis is true, we want to "model" the results we would
expect under the null.
- IF the drug had absolutely no effect, we would expect that 50% (actually 49.82%) would
fall in the drug group and 50% (actually 50.18%) would fall in the control group.
- That means we would expect 36 deaths to have come from the drug group and 36 deaths from
the control.
- This is what would happen IF the null were true.
- We know that in actual practice there will be variability around those values (perhaps
34 of the deaths would be in the drug group, or maybe even only 30.)
- We want to know how likely it is that we would have an experiment where there is
absolutely no drug effect, and yet we get a 26/46 split.
So, if the null is true, ideally 50% of the deaths should be from the
control group, and 50% from the Ritonavir group.
We want to model the kind of results we would expect from a 50/50
split.
Draw 72 cases, corresponding to the 72 deaths.
Randomly assign those deaths to the Control and Ritonavir group
with a 50/50 probability of falling into each group.
Do this repeatedly, corresponding to a "hypothetically"
large number of replications of the experiment.
Determine the distribution of the number of deaths assigned to
the Ritonavir group.
Calculate the frequency distribution of the number of deaths per
experiment:

Plot the resulting histogram:

Count the number of experiments that had results as extreme as
the ones we had--this is, the number of experiments with as few as
26 deaths in the Retonavir group as as many as 46.
Easiest to see from frequency distrib.
0 - 26 = 9
46 - 999 = 10
Number more extreme = 9 + 10 = 19
Probability of a result at least as extreme under the null
hypothesis = 19/1000 = 0.019 = 01.9%
This is a result that is very unlikely to occur by chance if
the null hypothesis is true, so reject H0 .
I did this using 100,000 samples, instead of 1000, and came up with the
following. It is presented only because it confirms the legitimacy of the 1000
sample case.

This distribution is shifted slightly to the right because of the way it
is plotted around the midpoint of an interval. No problem.
(When I allowed for fewer intervals, I got a very strange distribution
where every other frequency is high. This must have to do with rounding and
with the way the random number generator works.)
Theoretical calculation
- A glance at the figure will show that this is very close to a normal distribution.
- I know, from material that we have not covered, that the mean of the distribution with
an infinite number of samples would be 36, and the standard deviation is
4.2426.
- We could translate "26 deaths" to a z score, and use that to
calculate the area under the normal distribution to the left of that z score.

- From tables of the normal distribution, the probability of a z score less than
-2.357 is .009.
- For a two-tailed test of z > |-2.257| we double this to
.018.
- Emprically I got p = .019, and theoretically I got .018.
Close enough.
- Emphasize this calculation and where the pieces, and the probability,
come from.
Hypothesis Testing
For the last few years, psychologists have been all up in arms over the
issue of whether or not hypothesis testing is a moral, ethical, honest way to
make a living. The question has still not been resolved, but the report of the APA task force is at
Basically, they explored those situations in which it makes sense to test a
null hypothesis, and those situations in which it does not. This is an
excellent article, and one which I would recommend that all students (and
faculty) read carefully. The committee has done a good job of steering a
middle course between those who wanted to ban hypothesis tests from the
journals (an idea I consider crazy) and those who wanted to maintain the status
quo.
I am not going to develop the arguments here, but the document is available
for those who want to read it. One of the things that I have started doing,
however, is to spend more time on looking at effect-size measures. We will see
them throughout the course. The idea is to say something more than "The
difference is significant."I started out by talking about the lab that we did on Thursday.
Next I will go to a study reported in Hoaglin, Mosteller, and Tukey (1983), on
beta-endorphins and their role in pain.
Thursday's lab was a true hypothesis test, but without using any standard statistical
procedure. It was the "ideal world" equivalent that other tests aim for.
It also ties nicely to some statistical methods that I plan to introduce this
semester--starting with the Hoaglin study.
Conclusions
- From these results we know that when the drug is totally ineffective, the probability of
a result as extreme as 26 has a (one-tailed) probability of .009, and a (two-tailed)
probability of .01. (.01 is not twice .009 simply because of sampling
error.)
- Therefore such outcomes are extremely unlikely to arise if the drug is ineffective.
- Therefore it is more reasonable to assume that the 26/46 split did not come from a
situation where the drug is ineffective.
- That means that we will conclude that these results came from a situation where the drug
is not ineffective. Therefore the drug is effective to some extent.
- We will conclude that the drug has some effect in reducing the incidence of death among
AIDS patients.
- In fact, Ritonavir is one of the truly effective drugs, though it is most effective when
it is used in combination with a suite of other drugs.
Key Concepts
The following concepts came up, directly or indirectly, in the last example. Explain
each of them in turn.
- Null hypothesis (H0)
- The hypothesis that the probability of death under Ritonavir is equal to the probability
of death under the control condition.
- i.e. p(Ritonavir) = p(drug) = .50
- Alternative hypothesis (H1)
- This is the hypothesis that is the contradiction of the null
- It is the hypothesis that the drug group has a lower death rate (or
a different death rate).
- Research hypothesis
- This is the hypothesis that we started out to investigate
- It is almost always aligned with the Alternative hypothesis.
- One-tailed test
- This tests the hypothesis that the death rate in the drug group is lower than
the death rate in the control group.
- A second one-tailed hypothesis is that the death rate is higher than the rate
in the control group.
- Here that wouldn't make any sense--or would it?.
- Two-tailed test
- This tests the hypothesis that the death rate in the drug group is more extreme than
the death rate in the control group.
- One- versus two-tailed tests.
- I'm not going to get into this argument.
- For this course, we will almost always use two-tailed tests.
- Type I error
- The probability that we will falsely reject a true null hypothesis.
- In this case, we know that 6 times out of 1000 we will actually have a true null but
will get a 26 and reject.
- We usually set a probability (e.g. .05) as a critical value.
- When p < .05 we reject
- When p > .05 we don't reject
- This latter approach sets the probability of a Type-one error at .05.
- The probability of a Type I error is generally represented by alpha
(a)
- Type II error
- The probability of falsely retaining a false null hypothesis.
- We can't calculate this for most cases because we need to know how "effective"
the drug is.
- give example
- If the drug actually means that 48 versus 52% of the deaths really should fall
in the drug group, we are not likely to detect that.
- In our example, that would round off to expecting 34.56 = 35 deaths, which has an
empirical probability of .436.
- Clearly, that result would not lead us to reject.
- But it would be easy to detect an effect if the drug cures all
AIDS cases.
- This represents the fact that the probability of a Type II error
depends heavily on just how effective the drug is--or just how
much of a treatment effect we really have.
- We did sort-of look at this in the lab when we drew samples where we
expected the Ritonavir group to do twice as well. We could look to see
how often those results produced 26 deaths in the Ritonavir group.
- Briefly mention power and the effect of sample size.
- This probability is generally denoted by beta (b)
- Decision making
- Draw the standard diagram on the board.
-
| Decision |
True
State of the World |
| |
H0
True |
H0
False |
| Reject |
Type I |
Power |
| Retain |
Correct |
Type II |
A Second example
- Example from Hoaglin, Mosteller, and Tukey (1983)
- Review the basic idea behind the experiment
- Patients were measured for beta-endorphin levels 12 hours, and again
10 minutes, before surgery.
- There were 19 patients
- This represents a repeated-measures study (often called paired
samples or matched groups.)
- The data are below. I made 2 tiny changes to eliminate differences of
0.0
-
| Patient |
12 hrs |
10 min |
Differ |
| 1 |
10.0 |
6.5 |
-3.5 |
| 2 |
6.5 |
14.0 |
7.5 |
| 3 |
8.0 |
13.5 |
5.5 |
| 4 |
12.0 |
18.0 |
6.0 |
| 5 |
5.0 |
14.5 |
9.5 |
| 6 |
11.5 |
9.0 |
-2.5 |
| 7 |
5.0 |
18.0 |
13.0 |
| 8 |
3.5 |
42.0 |
38.5 |
| 9 |
7.5 |
7.4 |
-0.1 |
| 10 |
5.8 |
6.0 |
0.2 |
| 11 |
4.7 |
25.0 |
20.3 |
| 12 |
8.0 |
12.0 |
4.0 |
| 13 |
7.0 |
52.0 |
45.0 |
| 14 |
17.0 |
20.0 |
3.0 |
| 15 |
8.8 |
16.0 |
7.2 |
| 16 |
17.0 |
15.0 |
-2.0 |
| 17 |
15.0 |
11.5 |
-3.5 |
| 18 |
4.4 |
2.5 |
-1.9 |
| 19 |
2.0 |
2.1 |
0.1 |
| Mean |
|
|
7.7 |
| st. dev. |
|
|
13.519 |
- What would students conclude just from looking at these
data?
- I would conclude that since most of the patients had higher endorphin
levels just before surgery, the body must be increasing its production
of endorphins in
response to stress.
- But perhaps this is just a fluke.
- We could create a model of what we would expect if the null were true.
- If the null were true, we would expect that the probability of a score
going up would equal the probability of a score going down = p =
.50.
- Moreover, the magnitude of the positive scores should equal the
magnitude of the negative scores, on average.
- How could we test this hypothesis?
- Start with what would happen if H0 were true
- The probability of the 12 hr score being higher than the 10 min
score would be .50, and vice versa
- That means that about half of the difference scores would be
positive and half negative.
- That means that, on average, the mean difference score would be 0.
- We can compare our mean difference score with 0.
- Then we could figure out what the distribution of means of difference
scores would look like, and compare our obtained mean difference to that.
- There are several ways to do this, and they all lead to slightly
different tests.
- 1. We could assume that we were sampling from a normal
distribution of difference scores with a mean of 0 and a standard
deviation = ???
- 2. We could assume that we were sampling from a normal
distribution of difference scores with a mean of 0 and a standard
deviation estimated by the standard deviation of our sample (13.519).
- 3. We could do something like we did on Thursday based upon the
assumptions in #2
- In other words, draw 19 scores from a normal
distribution of mean 0 and sd = 13.519
- Calculate their mean.
- Repeat this process 1000 times and plot the result.
- Compare our mean to the means we get when H0 true
- Reject or retain the null.
- This is not hard to do, but it is awkward in SPSS, so I didn't
do it.
- 4. We could use a formula to tell us exactly what we would
get if we did what I just described.
- That would calculate a statistic that measures the distance
between what we found (mean = 7.7) and what we would find if H0
were true, and express it as a function of the st. dev.
- This is what a t test actually does.
- Note: this says that the t test is really just a
formulaic way of finding out what would happen if we drew all
those samples.
- When we do it this way, the probability of the data given the
null = .023
- 5. There is a problem with #3 and #4, in that they have us draw
our samples from some normal population. But who said that the
population of difference scores under the null would be normal?
- Perhaps it is logarithmic, or exponential, or something else.
- 6. An Alternative
- If the null were true, the 12 hour score is just as likely to
be greater than the 10 min as it is to be less.
- Therefore, under H0 the difference score is
just as likely to be + as -
- We could model this by taking our difference scores and
randomly assigning + and -
- This would give a set of difference scores, and a mean
difference, that is just as likely as the set we got if
the null is true.
- We could repeat this a very large number of times.
- Then we could compare our obtained mean difference against
this.
- This has the advantage that we are not assuming that our
distribution of differences is normally distributed.
- As you can see below, it is pretty hard to argue that
the differences are normally distributed.

- What I have done is to draw many samples (2000) where I let
the sign of the difference be chosen randomly.
- I could have simply plotted the means of the differences, but
I chose to plot something which is a function of that. I divided
each mean difference by its standard error, which is a function
of the standard deviation of the differences.
My results follow:

These results show the t values (my statistic) that we would expect if
the null were true. It also shows the location of the obtained statistic ( =
3.354) and the probability of being more extreme than that ( = .003).
Conclusions from this example
- We have to set up a model that reflects the null hypothesis.
- There is not just one legitimate model
- Possible models
- Data were drawn from a normal distribution
- Data were drawn from some other specified distribution
- Data were drawn from a distribution that looks like ours except for
sign
- Each of these leads to some sort of (at least imaginary) sampling study
- The t test that we will discuss in about 2 weeks is based on the
first model
- except that we do things by formula rather than by actual sampling.
- An important field of statistics (Resampling statistics) is based on the
third model.
- There is no "right" approach.
- A lot of the early statisticians, and many of the current ones,
thought of the first model as an imperfect way of estimating what we
would find under the last model.
- They justified traditional parametric tests on the grounds that they
did a good job of approximating the third model
- The third model has only become practical once we got computers that
could draw thousands of samples almost instantly.
Terminology
|