header.gif (5403 bytes)


Multiple Regression #3

4/2/2002

Announcements

 

Multiple Regression Cont.

I'm going to cover a number of miscellaneous topics, each of which is important. I have moved over the material in mediation from last Tuesday, because that is an important topic. Then I need to talk about suppressor variables, because I am asked about them several times every year by graduate students. I want people to understand what they are and have a source to look at when the issue comes up. Then I talk about power, and the most important thing there is the question of "how many subjects/variable"--which isn't even the right question. The material that I have here is far more extensive than I could possibly cover in class. The idea is to have something as a future reference. Don't expect that I will go over it all.

Moderating and Mediating Relationships

This has become one of the "in" topics over the past 10 years, and it is critical that psychologists understand it. It applies to both experimental and clinical students.

The classic paper here is Baron, R.M. & Kenny, D. A. (1986) The moderator-mediator distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173-1182. This is a "must read" for anyone doing stuff with moderating or mediating relationships. You can also find at least one good page on this by searching on the web.

 

Moderating relationships

I discussed this last week, and have it here only because I want to put the two topics together.

A moderating relationship can be thought of as an interaction. It occurs when the relationship between variables A and B depends on the level of C. I gave the following example, using Sex as a variable C. There is no reason at all why C needs to be a dichotomy, and it usually is a continuous variable.

 

To look for a moderating relationship you form a new variable, which is the product of the two predictors. 

For example, in the above we want to predict SAWBS (S) from Depression (D) and Gender (D). So create a variable (SG) = S X G. Then solve for the model

D = b1S + b2G + b3SG + b0 

IF the SG term is significant, we have a moderating relationship.

I suggested that you should always "center" your main effect variables before you multiple, and use the centered variables, together with their product, in the model.

You center a variable by subtracting its mean from all observations. 

The reason for centering is to break up the high correlations between each of the main effect variables and the interaction term.

 

Mediating Relationships

A mediating relationship is quite a different thing, although "medoerating" and "mediating" are often confused. A mediating relationship is one in which the path relating A to C is mediated by a third variable (B). 

We all know that older drivers, up to a point, are safer than younger drivers. But I'm sure that we don't think that the aging (some would say deterioration) of the body, or the mere passage of time, somehow leads to safer driving. What happens, as all right thinking people will agree, is that age leads to wisdom, and wisdom leads to safer drivers. Hence "wisdom" is the mediating variable that explains the correlation between age and safe driving. (You can guess that this example was dreamed up by an older (more mature) driver.).

I'm going to use the example from Esther Leerkes' work, which is also the example in the text. I use it because it is such a nice clear one.

Leerkes and Crockenberg (1999) were studying the relationship between how a new mother was raised by her own mother 20 + years before (maternal care) and the new mother's level of self-efficacy as a mother. The idea being that it your mother showed high levels of maternal care toward you, you would feel more confident of your ability to mother your own child. 

Indeed, the correlation between Maternal Care and Self-Efficacy was .272, which is significant at p < .01.

But Leerkes expected that this relationship was mediated by self-esteem, such that if you had good maternal care, you will have good self-esteem, and if you have good self-esteem, that will, in turn, lead you to have high self-efficacy. 

This is illustrated in the figure.

 

Next we will run three regressions.

1.  Predict Self-Efficacy from Maternal Care

 

2. Predict Self-Esteem from Maternal Care ; then Self-Efficacy from Self-Esteem.

 

3. Now predict Self-Efficacy from both Maternal Care and Self-Esteem. (I am also repeating the path from Maternal Care to Self-Esteem. 

 

Notice that the path from Maternal Care to Self-Efficacy is no longer significant. When we also use Self-Esteem as a predictor, that carries the weight of the regression and we no longer need Maternal Care.

 

Baron and Kenny (1986) laid out three conditions for mediation. First, maternal care has to predict self-efficacy. Second, maternal care has to predict self-esteem. Third,, self-esteem has to predict self-efficacy. Finally, the path between maternal care and self-efficacy has to decrease (preferably to non-significant, but not usually that far) when self-esteem is added.

We see that all of this happened. But how do we know if the drop in the maternal care -> self-efficacy was significant? We don't, really, but we have a different test. We ask if the maternal care --> self-esteem --> self-efficacy path is significant. Baron and Kenny give a test for this, although they laid it our better elsewhere.

The coefficient for the maternal care --> self-esteem --> self-efficacy path is equal to the product of the two betas, which is .403*.380 = 0.153. We also know that the standard error of this (combined) path is given by the following formula. (The standard errors are taken from the printout, and given in the following table.)

 

Now that we have the path and its standard error, we can calculate a t statistic.

This is a t on N - 3 df.

This is clearly significant, so we can conclude that there is a mediating path running three self-esteem.

 

Suppressor Variables

This is a topic that won't go away. I have a number of people ask me about it each year, and it is a hard thing to jump back into when you are walking down the hall on the way to get coffee.

Almost everything that I will say here is based in some way on Cohen and Cohen (1988\3), pages 84-91. They do the best job I know of working through the issues, and I am shamelessly copying some of their ideas.

Classic example:

I want to give an example so people know what the issue is, but then I want everyone to forget the example for a few minutes, and be open to related, but different, situations.

Suppose that both X1 and X2 are positively correlated with Y. That means that if either of those variables increases, we expect to see Y increase. But suppose that the regression equation comes out as

Y = 1.3X1 - 2.4X2 + 12.78

In this equation we see that the prediction for Y actually decreases as X2 increases, which is counter to what we would expect. This is one of the things that we mean by "suppression." Why would it happen?

Background

We will start out with some things that don't have specifically to do with suppression. Assume that the variables are scored such that the predictors each correlate positively with Y. That is not a real restriction. It is one of those "Assume without loss of generality" things that mathematicians like.

a.  Complete independence: R2Y.12 = 0

 

b.  Partial independence: R2Y.12 = 0 but  r12 0,

c. Partial independence: r12 = 0, rY2 = 0, rY1 0

d.  Partial independence again, both rY1 and rY2 0, but r12 = 0

There is nothing particularly difficult in the above examples.

e. Normal situation, redundancy: no simple correlation = 0

Each semi-partial correlation, and the corresponding beta, will be less than the simple correlation between Xi and Y. This is because the variables share variance and influence .

f.  Classical suppression:  rY2 = 0

Here the presence of X2 will increase the multiple correlation, even though it is not correlated with Y. (You can't visualize the multiple R2 on these ballantines.) What happens is that X2 suppresses some of what would otherwise be error variance in X1.

Cohen gives the following, because rY2 = 0:    

But because r212 must be greater than 0, the denominator will be less than 1.0. That means that r2Y.1 must be less than R2Y.12 . In other words, even though X2 is not correlated with Y, having it in the equation raises the R2 from what it would have been with just X1

The general idea is that there is some kind of noise (error) in X1 that is not correlated with Y, but is correlated with X2. By including X2 we suppress (account for) this noise, and leave X1 as an improved predictor of Y.

Cohen's classic example (Maybe it was Darlington), is of a speeded test of history. We want to predict knowledge of historical facts. We give a test which supposedly tests that. But some people will do badly just because they read very slowly, and don't get through the exam. Others read very quickly, and do all of the questions. We don't think that reading speed has anything to do with how much history you know, but it does affect your score. We want to "adjust" scores for reading speed, which is like saying "The correlation between true historical knowledge and test score, controlling for reading speed."

g.  Net suppression    all rs are positive

I have trouble understanding Cohen's point with this one, though I have in the past. This is the traditional example where X2 is positively correlated with Y, but has a negative regression coefficient. The primary purpose of X2 is to suppress the error variance in X1, rather than doing much about Y. As you can see in this example, X2 has much more in common with the error variance in X1 than it does with the "good" variance in Y. This is like the previous example, except that there is some Y, X2 overlap.

 

h.  Cooperative suppression  r12 < 0

This is the case where the two predictors are negatively correlated with each other, but both are positively correlated with Y

This is a case where each variable will account for more of the variance in Y when it is in an equation with the other than it will when it is presented alone.

Cohen and Cohen point out that this is the ideal for which we often search, usually in vain. The best way to get a high multiple R is to find two variables that are positively correlated with Y, but are negatively correlated with each other.

Cohen and Cohen suggest that one indication of suppression is a standardized regression coefficient (bi) that falls outside the interval 0 < bi < rYi. (Notice that I am talking about the standardized coefficient. Notice also that we need to reverse the order of that inequality for negative correlations.

To paraphrase Cohen and Cohen (1983), if Xi has a (near) zero correlation with Y, we are talking about possible classical suppression. If its bi is opposite in sign to its correlation with Y, we are looking at net suppression. And if its bi exceeds rYi and is of the same sign, we are looking at cooperative suppression.

It is very difficult to find significant suppression, but that doesn't stop faculty and students from coming to me for explanations of unusual results. The problem is something like the next problem that I'll deal with. For both suppression and power, the statisticians tell us that certain things are unlikely to happen, but they have a habit of happening anyway.

Power and Sample Size

These two topics obviously belong together, in that you can't talk about power without talking about sample size. On the other hand, much has been said about sample size without putting it in the context of power. For example, there is a long-standing, and most likely incorrect, rule of thumb that says that you need at least 10 observations per variable. This appears to be saying that your multiple regression will be invalid if you don't meet that rule, but really it is a rule about power. It is really saying that you don't have much of a chance of finding a significant relationship unless your n is that large, which is quite different from saying that your regression won't be legitimate.

First I'm going to cover power from the traditional approach--effect size, etc.

Power Calculation

We need a measure of effect size. This is true of any calculation of power.

This effect size will be called f2

First consider the situation where we want the power for a significant R2

We want to get the power that our overall multiple R with 4 predictors and 40 subjects will be significant if R2 = .35.

Define f2 = R2/(1-R2)

let p = number of predictor variables.

Let v = Np – 1 = number of df for error

define l = f2(p + v + 1)

Look l up in the tables Cohen gives.

For our example,

R = .59; R2 = .35

f2 = .35/(1-/35) = .35/.65 = .54

p = 4

v = 40 – 4 – 1 = 35

l = f2(p + v + 1) = .54(4 + 35 + 1) = .54´ 40 = 21.6

round this down to 20 to be conservative

round v down to 20 to be conservative.

A copy of one of Cohen’s tables is on the following transparency. (note: I have copied this from Cohen for classroom use. It is a copyrighted table.)

Table 9.3.2

Power of the F Test, u= 1 to 8

a = .05

(u = number of predictors; v = df for error)

                                                                                  l

u

v

2

4

6

8

10

12

14

16

18

20

                       

1

20

27

48

64

77

B5

91

95

97

98

99

 

60

29

50

67

79

88

92

96

98

99

99

 

120

29

51

68

80

88

93

96

98

99

99

 

00

29

52

69

81

89

93

96

98

99

99

                       

2

20

20

36

52

65

75

83

88

92

95

97

 

60

22

40

56

69

79

87

91

95

97

98

 

120

22

41

57

71

80

87

92

95

97

98

 

00

23

42

58

72

82

88

93

96

97

99

                       

3

20

17

30

44

56

67

75

82

87

91

94

 

60

19

34

49

62

73

81

87

92

95

97

 

120

19

35

50

64

75

83

89

93

95

97

 

00

19

36

52

65

76

84

90

93

96

98

                       

4

20

15

26

38

49

60

69

76

83

87

91

 

60

17

30

44

57

68

77

83

89

92

95

 

120

17

31

46

58

70

78

85

90

93

96

 

00

17

32

47

60

72

80

87

91

94

96

                       

5

20

13

23

34

44

54

63

71

78

83

87

 

60

15

27

40

52

63

72

80

86

90

93

 

120

16

29

41

54

65

75

82

87

91

94

 

00

16

29

43

56

68

77

84

89

93

95

                       

6

20

12

21

30

40

50

59

66

73

79

84

 

60

14

25

37

48

59

68

76

83

87

91

 

120

14

27

39

50

62

71

79

85

89

93

 

00

15

27

40

53

64

74

81

87

91

94

                       

7

20

11

19

28

37

46

54

62

69

75

80

 

60

17

24

35

45

56

65

73

80

85

89

 

120

13

25

37

47

59

68

76

82

87

91

 

OD

14

25

38

50

61

71

79

85

89

93

                       

8

20

10

18

26

34

42

50

58

65

71

76

                       
 

60

12

23

33

43

52

62

70

77

83

87

 

120

12

24

35

45

55

65

73

80

85

89

                       
 

00

13

24

36

48

59

68

77

83

88

92

                       

9

20

10

17

24

32

39

47

54

61

68

73

 

60

11

21

31

41

50

58

67

74

80

85

 

120

11

22

33

44

53

62

71

78

83

88

 

00

13

23

34

45

56

66

74

81

86

90

                       

 

(Modified slightly from Cohen, 1988, for class purposes only.)

 

So power for our example = .91. We have a very high probability of finding a significant correlation if the parameters are as I have specified them.

An alternative approach

But I said above that f2 = R2/(1-R2). Therefore R2 = f2/(1+f2)

This translates to small, medium, and large effects of

f2

R2

Small

.02

.02

Medium

.15

.13

Large

.35

.26

Remember that I am showing R2 in the right column, not R.

Maxwell (2000) raised some interesting questions about these values, and whether they are the correct ones.

My problem with what I have just presented is that it answers a question that I am rarely asked. By that I mean that we were computing the sample size, or power, for finding a significant R2. But we usually assume that the overall R2 will be significant. We are much more likely to care about is what difference a new variable makes on top of other variables that are already there.

 

Squared Semi-partial Correlations

Suppose that we want to examine the question we had earlier, where we are looking at what Agg3 can contribute to the relationship between TV23 and Agg13.

This, then, is the squared semi-partial correlation.

We can convert the squared semi-partial to l in a way analogous to what we did earlier.

 

 

Notice that this is roughly the same formula for l except that the numerator is the difference, rather than just R2, and the denominator is the unexplained variance from the full model

The analysis I ran earlier showed

 

So we have an increase in R2 of .104. Suppose that I had been smart enough to predict that.

Then l = (.104/.888)´ (2+424 + 1) = .117*(2+424+1) = 50.01

The power for this is off the chart, because I have such a huge sample size.

But suppose that my sample size was 50.

Then l = .117´ (2 + 47 + 1) = 5.85, which I’ll round to 6

power = approx. .55.

 

Remember that with the simpler case we really had what can be written as

So the only real difference is that this one is looking at the increase above something else.

The problem we have is that Cohen expressed f2 in terms of either r2 or the squared partial correlation (not semi-partial). We can’t just ask about a medium increase, which we will call f2, but about about an increase of .13 relative to where we were with one predictor. In other words, we need R2 as well as the gain. I come back to this below.

That makes things messy unless we are willing to guess at both correlations.

 

Power from a different perspective.

When we speak of power, we are speaking of the probability of rejecting the null hypothesis when it is false.

We should always calculate the power of our experiment before we begin, so as to maximize our chances of finding what we are looking for.

Are there any rules of thumb for estimating sample sizes when we don’t have power calculations?

Why should we need a rule of thumb if we have a table?

The simple answer is that we shouldn’t!!

What are these rules of thumb all about?

They are rules of thumb about POWER. They are not rules of thumb about something else.

So, if you run an multiple regression solution with a small sample size, you are foolish.

BUT, if an editor sends you a letter rejecting your paper because the significant result that you found was based on too small a sample, he or she is foolish.

Green’s analysis:

Green, S.B. (1991) How many subjects does it take to do a regression analysis? Multivariate Behavioral Research, 26, 499-510.

Evaluated numerous rules of thumb in terms of sample sizes needed for adequate power.

He found that most of them were too conservative.

He concluded that if you must have some rules, perhaps

N > 50 + 8p is ok for the overall multiple regression

N > 104 + p is OK for partial correlation of one variable with another, holding all other predictors constant. Notice that we are talking about partial, and not semi-partial, coefficients. It makes a difference, as I show below.

These are the rough sample sizes needed for a medium effect with power = .80..

To demonstrate this, assume that we have p = 2 and N = 106. Then N = 104 + p.

Here it gets complicated.

The values for small, medium, and large are for either R2 or for the squared partial correlation, not the squared semi-partial.

Suppose r2 with one predictor is .40. Thus we leave 60% unaccounted for. If we add a predictor and go to R2 = .50, then the squared semi-partial is .50 - .40 = .10, but the squared partial is (.50 - .40)/.60 = .167

Suppose r2 with one predictor is .70. Thus we leave 30% unaccounted for. If we add a predictor and go to R2 = .80, then the squared semi-partial is .80 - .70 = .10, but the squared partial is (.80 - .70)/.30 = .333.

So we can’t really talk about a the power for a significant semi-partial without knowing what the correlation without that variable was.

So let’s assume that with one predictor r2 = .40, and that with a second predictor R2 go to .478.

Then the squared partial = (.478 - .40)/(1-.40) = .078/.60 = .13

But here the squared semi-partial is .078.

Then using Cohen's formulae, 

 f2 = .078/(1-.478) = .15

l = f2(p + v + 1) = f2(N) = .15(106) = 13.78

But this gives a power of approx. .90, rather than the .80 that Green was aiming for.

Apparently his rule of thumb falls apart for few predictors.

If we had 5 predictors and expected an increase from 40 to 478, then power would be about .80.

Maxwell (2000) Psychological Methods, wrote an extensive paper on this issue. It is an excellent paper, though a rather discouraging one. Maxwell calculates that we need very substantial sample sizes to have an acceptable degree of power (power = .80).

Maxwell turns one of Cohen's equations around, does a quickie calculation, and comes up with

where Rf2 represents the full model, and Rr2 represents the reduced model. The denominator is just the squared semi-partial.

He then calculates that for power = .80 with a reasonable number of degrees of freedom, we can let l = 7.85. Then we can write

as our estimate of sample size.

 

From here on, Maxwell examined a whole range of approaches to sample size. His results are very discouraging, in the sense that he projects very large required sample sizes.

I don't doubt Maxwell's arithmetic, but I am left wondering about his conclusions. I have worked with a lot of regression problems over the years, and significant results are much more common than his numbers would show. And I mean "meaningful" significant results. The significant predictors are the ones you would expect, not some weird variable that popped up.

Maxwell partially addresses this by showing that even with only medium level correlations, and p = 5 and N = 100, the probability of finding at least one signficant predictor is .84. It doesn't completely answer my problem, because the predictors that I see being significant are usually the ones that theory would expect.--at least my post hoc theory :-).

 

The basic rules of thumb are conflicting and don't really work. The idea of "10 subjects per predictor variable" seems like a bad idea. Green's N = 50 + 8p for the overall multiple R, and   N = 104 + p for partial R are probably too simplistic, but they're better than nothing. (Maxwell would probably have a fit if he saw that.)

Stepwise Regression

I can summarize my remarks here as "stepwise is unwise." End of discussion.

 

Regression diagnostics

I can't possibly have time to cover this, so I won't try.

Review the material in the book. It is pretty extensive.

 

Last revised: 04/02/2002

 

Last revised: 04/02/02