I am combining chapters 3 and 4, because they both deal with distributions.
The part of Chapter 4 that deals with hypothesis testing will come next.
One of the most common assumptions in statistics is that the data we are looking at (or the errors associated with each data point) are normally distributed. I will have much to say about this on Friday when I do Faculty Seminar, but for now I will be content with describing and using the normal distribution.
I'll start with a set of data that I collected on myself, and that I refer to in the text. I will then move on to another real dataset assembled by someone in UVM's Political Science department. The first study was a standard reaction time task, in which 1, 3, or 5 numbers were presented very briefly, than a single number was presented, and the task was to hit a button depending on whether the new number had been presented in the first set. The original experiment was done by Sternberg in 1966.
Explain the hypotheses and the study in more detail.
The data are available on the disk for the course as an ASCII file named Tab2.1. There are six variables (I think) named trial, RxTime, Nstim, Yesno, Cellcnt, Half. (The variable "Half" may not be there.) An SPSS sav file is available at RxTime.sav.
What do students think that the complete data set, ignoring levels of the independent variable, would look like?
I might suspect multimodal and skewed. Why?
There might be a bimodal distribution due to yes/no
There might be a trimodal distribution due to NStim.
If the two combine, I don't think that we would be able to pick that up here.
Many mixtures come out looking normal, though a bit flatter on the top.
A histogram for the the reaction time variable for the complete data set using SPSS follows. A normal distribution has been superimposed.
Well, it isn't multimodal, but it is skewed.
I might think that this is because we have combined data across the three NStim conditions. But when I break them out separately I get:
These are hard to read because I crammed them on one line, but the pattern is clear--it isn't just that we are combining across NStim, although for the 3 and 5 stim cases the data are less skewed.
A common approach to skewed data, especially something like reaction times, is to transform them. For example: X = ln(rxtime) or x = sqrt(rxtime)
Ask them why taking a square root (or a log) might have this effect.
If the data are negative or zero, we usually add a constant before transforming.
That is not a problem here, because we couldn't get a negative or zero rxtime.
In SPSS
First do this within SPSS with an example that is badly positively skewed
e.g. 1 4 4 5 5 5 6 7 11 15 25 49 150
Compute lnrxtime = ln(rxtime).
Compute sqrtrxtm = sqrt(rxtime).
This gives us two new variables.
Plotting the two new variables:
Sqrt(rxtime)
ln(rxtime)
I like this one better, so we will work with it. (There really isn't much to choose between.)
What do we know about normal distributions?
- First, we can see a truly normal distribution plotted in the above plots.
- We know that the distribution is symmetric.
- This obtained one is reasonably symmetric, but with a slight bump on the right.
- Half the scores should be above the mean.
- mean = 4.08 for our ln reaction time data.
- In fact, exactly 50% are at 4.08 or below, so that's good.
- We know something about other points, but first we need to define z scores.
Taking a raw score of 4.00, just as an example,
This means that a score of 4 is about 0.39 standard deviations below the mean.
We have tables of the normal distribution which tells us what percentage of the scores should be more (or less) than z standard deviations from the mean (i.e. have a z score greater than, or less than, .39.
The tables show that .348 should be in the smaller portion and .652 should be in the larger portion. Explain.
This means that if these data were normal, exactly 34.8% of the scores should be at or below 4.00. We actually have 34.3%--we're doing very well.
Quartiles:
Working things backward, we could ask what z scores, and then what raw scores, should cut off the 25th and 75th percentile.
From the tables, we know that a z = + 0.675 cuts off the upper and lower 25%. (In the equation below, ".625" is a typo, and I'll fix it soon.) These correspond to raw scores of
From the printout I know that 26.3% and 78.7% lie below these two cutoffs. That's pretty close to 25% and 75%.
We could figure this out for some of the other cuts if we wanted. The following table is from a SPSS Frequencies, using the Options menu to select the desired statistics
Notice that there is very good agreement between the results obtained from the z approximation and from the frequency distribution itself. Emphasize that this is an approximation unless the distribution is exactly normal.
There are two measures there (skewness and kurtosis) that should be 0.00 for a normal distribution. But I don't have any sense how far those actual values really are from 0.00. BUT, as a rough rule of thumb, if the skewness or kurtosis statistic is not more than twice as large as its standard error, the data are not reliably different from normal.
Normal Probability Plots, or Quartile-Quartile (Q-Q) Plots
A common way for statisticians to examine the normality of a variable is to plot a Q-Q plot. I have plotted these below for the original RxTime and for the lnRxTime variables. If these were perfectly normally distributed, they would fall exactly on the straight line.
It is clear that the ln transformation is a major improvement.
The Minitab News Group Newsletter had a nice article about these. It is not available on the net, but I scanned it in for students in this course. It can be found at NormalPlots.html. Ryan and Joiner, at Minitab, also wrote a discussion of these plots. They said more than I suspect students want to know, but their article can be found at Normal Probability Plots.
Warning: The following may be misleading until we cover better ways of examining results like this.
Deborah Guber in Political Science published an article about 2 years ago that examines the relationship between the per pupil expenditure for public education and test score results on the SAT. She collected data from all 50 states. What I think is interesting about these data from the point of view of what we are doing today is the shape of the distributions. But first, let's look at the relationship she was trying to study. This is an oversimplistic approach (which was her point).
Now let's look at the individual variables.
Obviously, these variables are going to give us trouble if we need to rely on normality. Fortunately, we can do a lot with regression without normality.
One difficulty here is that a transformation is not going to sort out the data--At least for the Combined score. The data really are bimodal, and that is an important feature that I would not really like to eliminate.
Can students explain why the distribution might look like this? There is a good reason.
Exploratory Data Analysis
Going back to reaction time data.
One of the things that I would like to do is to compare the reaction times when subjects see different numbers of standard stimuli (1, 3, or 5).
One good way to do this is with a set of stem-and-leaf displays. I do this in the text for the complete data set, and I'm really kind of lazy about doing it three times more for the individual data sets. I'll leave that to the students.
A second way is to use boxplots. These are discussed in the chapter, though the emphasis there is on plotting for one group. I will do it first for the combined data, and then for the three groups separately.
Combined data--dependent variable is lnrxtime:
Note that the bulk of the distribution looks fine, but there are outliers at the top--especially observation # 284.
Explain:
- median
- ends of box (quartiles)
- whiskers
- ends of whiskers (I suspect they are at 5 and 95%.)
- outliers
Why don't you expect to find outliers at the bottom?
Grouped data
What do students expect about reaction time if the experimental hypothesis is true?
What is the null hypothesis????
I would expect longer reaction times as the number of standard stimuli increases.
If I used the log of the reaction times instead, I would get
which really isn't all that much different
Note how these data fit with the research hypothesis.
What might we conclude?
Time series data
These data were collected over time. I want to make sure that time didn't play a role.
Why might it play a role?
- Practice
- Fatigue
- Other physical or psychological changes
What would happen if it did play a role?
- Observations would not be independent, as we assume they are.
- The net effect of that is that we wouldn't really have as many observations as we think we do. (At least we wouldn't have N-1 df.
How to plot
Each cell of the data is ordered by the cell number, which is the order, within a cell, in which the observations were recorded.
We could plot lnrxtime against cell number
There would be 6 points for each cell number--no problem.
Note that there do not appear to be systematic changes over trials. I fit a line to the data, but it's rise is very slight. (A test on the slope is not significant.)
Error Bars
In some fields of psychology, error bars are common. This is especially true in fields like learning and neuroscience.
SPSS won't let me draw the bars the way I would like to, so I'll start with just a bar graph.
Now I'll draw what SPSS calls an error bar graph.
Those bars are at the mean + 2 standard errors of the mean, which I will define soon. For the moment, it is the variability that we would expect in the mean over repeated replications of the experiment.
Now imagine that you superimposed these two graphs on each other. That would give you a bar graph with some little lines going up and down. That is what most people are looking for.
Conclusions:
What do the data on reaction time tell us?
- First, they are reasonably well distributed, especially after we transform them.
- There is no particular trend over time, so we don't have to worry about that.
- The scores are higher when there are more stimuli in the comparison set.
- This suggests sequential processing.
- There are higher scores when the stimulus was not in the comparison set. (We didn't show this here.)
- This also suggests sequential processing.
- Any others?
What do the data on per pupil expenditures tell us?
- The data are weirdly distributed.
- The percentage of students taking the SAT plays a surprisingly important role.
For Thursday:
Be sure to read the material on hypothesis testing.
We're going to run a hypothesis test using SPSS, but without covering any of the more traditional hypothesis tests. I like this lab, and I think it illustrates the problem in a very clear way.
Last Revised: 09/10/01