# Regression to the Mean

Regression to the mean is something that confuses many people, not just students. The classic example is from Galton, who predicted that tall parents would have children who are shorter than they are, while short parents should have children who are taller than they are. People's first problem is that they don't believe it. Once are forced to see that it is true, their second thought is that this will mean that over time the population will become more and more heterogeneous in height, which we know not to be true.

Regression to the mean is a difficult problem to teach. It isn't hard to show that it is logically true, but it is hard to explain why we aren't all 58" tall. This page is a brief attempt to explain both. To do so I will use the data that Galton used, thanks to xxxxx. This data set has two variables. The first is called "midparent" and it is the average height of the parents. The mother's height was first multiplied by a constant (I think that it was 1.2) to adjust for the normal difference in heights between men and women. The second variable is "child," which is the height of their child, again adjusted for gender. Heights were rounded to the nearest half inch, which is why the resulting graph looks grid-like. The heights are plotted below, along with the regression line.

The best fitting regression line is
PredChild = 46.135 + 0.326*Midparent
Thus if we take a 72 inch set of parents, we predict that their child will be 46.135 + 0.326*72 = 69.61 inches. If we take a 62 inch set of parents their child is predicted to be 66.35 inches tall. Thus tall parents are predicted to have shorter children and shorter parents are predicted to have taller children.

But maybe you don't believe this. Maybe you think that in general the slope should be steeper, even though these data gave a flattish slope. But think of regression with standardized data. Then the intercept will be 0 and the slope (often called beta for standardized data) will equal r, which in this case is 0.46. Here again a parent who is 2 standard deviations above the mean will have a child who is predicted to be 2*.46 = .92 standard deviation above the mean. Again for the shorter parents. And we know that the slope can never be greater than 1.00 because the correlation can never be greater than 1.00