header.jpg (5403 bytes)


Logistic Regression

4/9/02

Announcements

Logistic Regression

I am building on the foundation that I hope I laid on Thursday. 

Definition: Logistic regression is a technique for making predictions when the dependent variable is a dichotomy, and the independent variables are continuous and/or discrete.

We are not really restricted to dichotomous dependent variables, because the technique can be modified to handle polytomous logistic regression, where the dependent variable can take on several levels. We have just exhausted my knowledge of the subject, but students can look in Hosmer and Lemeshow.

I am going to use the example from the text, because I want to have something they have seen before.

Alternatives:

Discriminant analysis

This is a more traditional approach, and the student’s advisor may first suggest that route. Here the idea is that we are using one or more independent variables to predict "group membership," but there is no difference between "group membership" and "survivor/non-survivor."

The problem with discriminant analysis is that is requires certain normality assumptions that logistic regression does not require. In addition, the emphasis there is really on putting people in groups, whereas it is easier to look at the underlying structure of the prediction ("what are the important predictors?") when looking at logistic regression. (Psychologists are rarely interested in the specific prediction, but in the role that different variables play in that prediction.)

Linear Regression

We could use plain old linear regression with a dichotomous dependent variable.

We have already seen this back in Chapter 9 when we talked about point-biserial correlation.

In fact, it works pretty well when the probability of survival varies only between .20 and .80. It falls apart at the extremes, though probably not all that badly.

It assumes that the relationship between the independent variable(s) and the dependent variable is linear, whereas logistic regression assumes that it is logarithmic. The reason it works in the non-extreme case is that the logistic curve is quite linear in the center. (Illustrate on board.)

Example

Epping-Jordan, Compas, & Howell (1994)

We were interested in looking at cancer outcomes as a function of psychological variables—specifically intrusions and avoidance behavior.

The data are available at logisticreg.sav

The emphasis here was on the variables, rather than on the prediction.

I'm going to start with one predictor, and then move to multiple predictors.

Variables

I have discussed some of these variables before in other contexts, so I shouldn’t need to go over them all.

What we are really interested in are Intrusions and Avoidance, but I need to start with a simple example, so I will start with the Survival Rating as the sole predictor. This also has the advantage of allowing me to ask if those psychological variables have something to contribute after we control for disease variables. (This is another example of what we mean when we speak of hierarchical regression.)

We can plot the relationship between Outcome and Survival Rating, but keep in mind that there are overlapping points. To create this figure I altered the data for the outcome variable to let 1 = success and 0 = failure (no improvement or worse) I don't know why I did that, but I am too lazy to redraw the graphs--especially the second one. This is an important point, because the results vary by which end of the dichotomy we predict.

I have used a sunflower plot here. Every line in the "dot" represents a case, so if we have a dot, we have one case; a vertical line = 2 cases; a cross = 3 cases; etc. Notice that as we move from right to left we have most of the cases as Outcome = 0, then cases equally spread between Outcome = 1 and 0, and then most of the cases at Outcome = 1.

Draw logistic function on this figure. The following was cut and pasted, after much much work with an image editor, from the text. It plots a theoretical continuous outcome (Y) as a function of a predictor (X)

censor.gif (28120 bytes)

(Note that on the left we have only tiny increases in the amount of the curve that is shaded black. In the center we have major differences in the amount of black. On the right we again have only minor differences in the amount of black.

Censored Data

Explain what censored data are, using the figure  above.

Explain how this leads to sigmoidal data.

Ask them when they would expect to see censored data in what they do.

  • Plain old boring pass-fail measures
  • Extravert--introvert
  • DWI versus not DWI
  • Those admitted to graduate school versus those not admitted.

Odds Ratios

The way most of us think about data like this is in terms of probabilities. We talk about the probability of survival.

But it is equally possible to think in terms of the odds of survival, and it works much better, statistically, if we talk about odds.

Odds Survival = Number Survived/Number Not-survived

or, equivalently,

Odds Survival = p(survival)/(1–p(survival))

as an aside, if odds survival are given as above, then probability of survival = p(survival) = 1/(1+odds)

One reason why we like working with odds is that odds are a logarithmic function of X, whereas, as we saw, probabilities are a sigmoidal function of X. One advantage of a logarithmic function is that it can increase without a ceiling. A second advantage is that if we plot the log of the odds, the relationship will be linear, which is always nice.

If we had an unlimited number of subjects, and therefore lots of subjects at each survival rating, we could calculate these odds. But we don’t have an unlimited number of points, and therefore we can’t really get them for every point. But that doesn’t mean we can’t operate as if we could. (Here is where the magic comes in. They are not going to see simple formulae for slope and intercept, like the ones they see in regression.)

Draw figure on the board plotting odds rather than probabilities.

At the very least, we have problems with probabilities at the high (and low) end. Once you get high enough you can't really get much higher in terms of probability. If a score of 70 gives you a probability of .96 of survival, a score of 80 can, at most, move you up .04. That isn't the case with odds. because odds have no theoretical upper limit.

Now we have to go one step further to get log odds

log odds will allow the relationship I discussed just above to become linear.

log odds survival = ln(odds) = ln(p/1-p)

Notice that, by tradition, we use the natural logarithm rather than log10. (There is no great reason why this couldn't have been worked out in base 10 logs, except that statisticians and mathematicians like natural logs.)

This is often called the logit or the logit transform

We will work with the logit, and will solve for the equation

log(p/(1-p)) = log(odds) = logit = b0 + b1SurvRate

This is just a plain old linear equation because we are using logs. That’s why we switched to logs in the first place. The equation would not be linear in terms of odds or probabilities, as we saw in the graph above.

b0 is the intercept, and we usually don’t care about it.

b1 is the slope, and is the change in the log odds for a one unit change in SurvRate.

We will solve for all of this by magic, since a pencil and paper solution is out of the question. We will use an iterative solution. (Explain)

Graphing the relationships

I have talked about the shape of these distributions. I want students to understand why we go to all this work, so I will jump ahead and calculate the predicted probability of success as a function of Survrate.

Emphasize that I am jumping ahead here.

To do this I just calculated the predicted log odds (survrate), then took exp(log odds) to get odds, and then took prob = odds/(1+odds).

Notice that SPSS has taken it on itself to predict nonsurvival rather than survival. It just bases that on the way the data are coded.

Notice it is sigmoidal in shape. (I could exaggerate it if I put in some cases with even lower SurvRate.)

Now plot as odds against SurvRate

That is very uninteresting, but I did it along the way. (Odds do extreme things at the extremes.)

Now plot ln(odds) against SurvRate

Notice that this is now linear.

That was the point of all this. I wanted to show that a set of data that we would probably think of as curvilinear can be made linear, but we have to remember that in doing so the thing that we are trying to predict is no longer probability of success, but log odds of success. But there is nothing to prevent us from converting the result we obtain back into statistics, such as odds or probabilities, that we are more used to dealing with.

Running Logistic Regression with SPSS

We did much of this on Thursday, so some will be just review.

Step 1 with SPSS

Intercorrelation Matrix of Predictors

Remember that these are linear relationships, but it gives us an idea of where we are starting.

I have simplified the output but the sample size was always 66 and the significance is shown by asterisks

Now we need to run the Logistic Regression itself, with Outcome as the dv and Survrate as the predictor.

SPSS Logistic Regression

I am using SPSS version 10.1 fro some of what follows, and version 9 for the rest. You can tell which is which, because 10.1 has prettier tables.

NOTE what they have done. I coded Worse/NoChange = 1, and they converted it to zero. Improved was a 2, and they changed it to 1. But they have kept the order intact.

Block 1: Method = Enter

In the tables above, the 40.022 is a test on the significance of this step--does the model fit better now that we have added one (or perhaps more) variables?

The value of 37.323 is a test on whether there is still variability in the data to be explained. There is, but that doesn't detract from the usefulness of SURVRATE. 

The 17.756 is another test on whether Survrate is a significant predictor.

What follows is version 9, so that people can see what is actually happening.I will skip that for class, because there is too much to cover.

Beginning Block Number 0. Initial Log Likelihood Function

-2 Log Likelihood 77.345746

* Constant is included in the model.

Discuss this printout in detail.

2. The next thing that we see is " –2 Log Likelihood"

This is a model with just an intercept included. It is like testing a linear regression model with just = b0 in it.

That model is very uninteresting, but it gives us a base to start from.

–2 Log Likelihood = 77.346 is a chi-square statistic on 1 df, which is clearly significant. But we don’t care about its significance here. A significant result means that the model does not fit the data adequately, just as a traditional chi-square test is significant when an independence model does not fit adequately.

3. Then SPSS enters SurvRate as an independent variable and reports another chi-square:

–2 Log Likelihood = 37.323

This is a model with just an intercept included. It is like testing a linear regression model with just = b0 +b1*X1 in it, where X1 = SurvRate.

This is a test on whether the new model, with SurvRate added, fits the data. A significant chi-square would say that it does not fit the data completely, though that certainly doesn't mean that it doesn't fit better than the previous model.

This is a chi-square on df = number of predictors + 1 (the constant)  = 2, and the test is significant.

But we aren’t so much interested in whether it is a perfect fit as we are in whether the model with SurvRate in it fits better than the model without SurvRate. For that test we just find the amount of improvement in chi-square.

Improvement = 77.346 – 37.323 = 40.023

This is itself a chi-square on 2 – 1 = 1 df, because we have added one predictor, and is certainly significant.

In other words, SurvRate adds significantly to the prediction of Outcome. (This is so much clearer in version 9.0 than in version 10.0.)-

4. I deleted the classification table from the output. I think that they are generally quite misleading, because even dreadful data can sometimes have a high correct classification percentage.

5. The Regression Equation

Log (odds Survival) = –.0812*SurvRate + 2.6836

This means that whenever two people differ by one point in SurvRate, the log odds of survival differ by –.0812

Notice that this interpretation is the same as for normal regression, except that we are predicting log odds.

Take someone with a SurvRate = 50. Then

log odds = –.0812(50) + 2.6836 = –1.3764

odds = e-1.3764 = .2525

This means that they are .25 times more likely to die than survive. (It is important to keep in mind whether we are predicting death or survival.)

If we take the inverse we have 1/.25 = 4.0, which means that with a 50 you are 4 times more likely to live than die.

Keep in mind that this is the odds, not the odds ratio. So you are 4 times more likely to live than you are to die--it is not contrasting you with someone else.

Now someone with a 51 would have

log odds = –.0812(51) + 2.6836 = –1.4576

odds = e-1.4576= .2378

The difference in log odds is –.0812, which is the coefficient.

But, what does that mean?

Notice that as Survrate increased the odds decreased. BUT these are the odds of NOT surviving. In other words, SPSS has chosen its own definition of "survival." You always have to watch out for this in logistic regression, regardless of the program you use.

But if odds = p/(1-p), then p = odds/(1 + odds)

For someone with a 50, p = .2524/1.2524 = .20

for someone with a 51, p = ..2328/1.2328 = .19

If you have a SurvRate = 50, you are not too likely to die. In fact, the probability of improving is .80. But if your survival rating increases to 51, the probability of your dying decreases a tiny bit to .19. Thus higher survival ratings are associated with lower probabilities of dying.

The only way I know of for being sure which direction things are going is to calculate a couple of probabilities and make sure you know what they mean. (You could read the manual, but who does that :-) )

We could calculate the probability of surviving for every subject using the above equation. In fact, SPSS will do that for us and SAVE all of the predicted values. We can then make a scatterplot of predicted values against SurvRate.

Notice that the probabilities (as calculated from log odds) do not exceed 0 and 1, and behave in just the ways I’ve been talking about. This again makes it obvious that we are plotting probability of getting worse, since it wouldn't make sense for the probabilities of survival to decrease as the rating of survival increases.

Notice the sigmoidal curve we have been talking about.

More about the coefficients:

There is another way to look at this printout that is not about probabilities. To the extreme right of the log odds ratio, we see 0.922. (Thus a one point increase in SurvRate multiplies the odds of death by .922.) We can say that a one point increase in SurvRate reduces the log odds of death by .0812. And e.-0812 = 0.922. 

In other words, the entry to the right is exp(ln odds).

Notice also that we have a test on the significant of the coefficients. This test is labeled "Wald", and it is a chi-square test (sort of). It isn’t exactly distributed chi-square, but nearly).

Here Wald = 17.7558 on 1 df, which is significant.

Notice that the Wald chi-square (17.7558), which asks if SurvRate is significant, doesn’t agree with the change in chi-square (40.022), which asks if adding SurvRate leads to a better fit. Blame this on Wald—it is not a great test--it tends to be conservative. (The comparable tests in linear regression (F and t) are exactly the same, but not in logistic regression.) The change in chi-square is the better test, but if we had added two variables, instead of one, we need Wald to tell us about each individually.

Predicting Group Membership.

We could make a prediction for every subject, and then put each subject with p > .50 in the "non-survival" group, and everyone with p < .50 in the "survival" group. This is shown below.

 

In this figure we have shown what actually happens, and you can see that a few people with a predicted value less than 50 actually got worse, and a few above 50 actually got better. But it isn’t a huge difference between predicted and actual.

SPSS actually gives us a table of outcomes in the printout, and this shows that 86.36% of the predictions were correct.

Classification tables usually have an important "feel good" component, but that can be very misleading. It is easy to come up with data where almost everyone survives, and then all we have to do to get a great %correct is to predict that everyone will survive. We will be pretty accurate, but not particularly astute. 

People shouldn't be impressed with my amazing accuracy to say that my Howell Test of galactic Threat (HTGT) is extremely accurate because I am never wrong.  (I simply make the same prediction for everyone--that they will not die from being hit on the head by a meteor), and haven't been wrong yet.)

Multiple Independent Variables

Epping-Jordan, Compas, and Howell (1994) were not really interested in the prediction of survival, although that’s a good thing. They really wanted to know what role Avoidance and Intrusions played in outcomes.

Here we have another hierarchical regression question.

Get them to see why we want to look at SurvRate first.

The approach we will take is the hierarchical one of first entering SurvRate and then adding one or more other variables, such as Avoid or Intrus. The first part we have already seen.

I could just enter both Survrate and Avoid at the same time to see what I get. But by adding them at separate stages (using "next" in the dialog box) I can get more useful information.

First use Survrate, and then added Avoid. outcome = dv.

Block 2: Method = Enter

At the first step with just SurvRate, -2 Log Likelihood = 37.323. (We saw this several pages back.) With two predictors, chi-square = 32.206. The difference between these is the test that Avoid adds something to the prediction over and above SurvRate. This difference is 5.118, which is shown above. It is on 1 df, and is significant at a = .0237. Thus Avoidance adds to (actually subtract from) survivability after we control for medical variables that are included in SurvRate.

Regression equation

We can see that the optimal regression equation is

log odds(worse) = -0.082*SurvRate + .133*Avoid + 1.196

We can also see the Wald test on these coefficients. Note that the test on Avoid gives a p = .035, which is somewhat different from the more accurate p = .024 that we found above.

If we want to go from log odds to odds, we see the result on the right.

e-0.0823 = .9210

e1.1325 = 1.1417

Thus a one point difference in SurvRate multiplies the odds of dying by .9210, when we control for Avoid. Likewise, a one point increase in Avoid multiplies the odds of dying by 1.1417 when we control for SurvRate.

This would make sense because we would expect the odds of dying would decrease (multiply by < 1) as Survrate increases, but that the odds of dying would increase (multiply by > 1) if Avoid increases.

CONCLUSION  Even after we control for the degree of illness (Survrate), avoidance is a bad thing.

What if we add Intrus as well?

The following is a greatly abbreviated output, just focusing on our problem. What I have done is to put in SurvRata at step 1, and both Avoid and Intrus at step 2. Thus the significance test is on whether the two variables together add significantly. They don't (p = .059).

Notice that the test of adding Intrus and Avoid after Survrate is not quite significant (p = .059). Why not?

Although Avoid has much to offer, Intrus has almost nothing. We have increased the change in LR-chi-square a bit more by adding Intrus, as well as Avoid, at this stage (from 5.118 to 5.673), but we have spent a degree of freedom to do this. Whereas 5.118 on 1 df was significant, 5.673 on 2 df is not. (It would need to exceed 5.99).

Note that Wald still calls Avoid significant.

We would be better off going back to the one predictor case.

The following is an e-mail exchange that I received last year. I think that it brings up some interesting points. I don't expect you to remember it all, but I would like you to remember that it is here, and refer to it if you need something like r-squared.


>A colleague using multiple logistic regression would like to have:

>(1) an overall measure of the explanatory power of the model, such as
>proportion of variance explained in linear regression, and ...

This issue has been considered extensively in the literature. Apparently
there is little consensus and no true R^2 measure in logistic regression.
Here are a couple of approaches you may consider.

(a) Obtain predicted values--probabilities--of your outcome and calculate
the correlation between predicted probabilities and observed (a
point-biserial correlation). The correlation between predicted Y and
observed Y is exactly what R represents in linear regression.

[For our date, r = .751, r2 = .564

Agresti, A. (1996). An introduction to categorical data analysis, Wiley.
(p. 129) discusses this approach.

(b) Use the model deviance (-2 log likelihood) to calculate a reduction in
error statistic. The deviance is analogous to sums of squares in linear
regression, so one measure of proportional reduction in error--I
think--that is similar to adjusted R^2 in linear regression would be:

pseudo R^2 = (DEV(model)-DEV(null))/DEV(null)

where DEV is the deviance, DEV(null) is the deviance for the null model
(intercept only), and DEV(model) is the deviance for the fitted model.

There exists a number of methods for calculating pseudo R^2 values. A good
discussion can be found in Maddala, G.S. (1983), Limited-dependent and
qualitative variables in economics, Cambridge.

There are many published articles on this topic. Here are just a few.

Nagelkerke, N.J.D. (1991). A note on a general definition of the
coefficient of determination. Biometrika, 78, 3, 691-692.

Agresti, A. (1986). Applying R^2 type measures to ordered categorical data.
Technometrics, 28, 2, 133-138.

Laitila, T. (1993). A pseudo-R^2 measure for limited and qualitative
dependent variable models. Journal of Econometrics, 56, 341-356.

Cox, D.R., & Wermuth, N. (1992). A comment on the coefficient of
determination for binary responses. The American statistician, 46, 1, 1-4.



>(2) a way to compare the contributions of two independent variables when
>both (and possibly other variables as well) are in the model, such as
>incremental R square in linear regression.

For effect sizes I understand that the odds-ratio is the measure of choice.
I don't know, however, how to determine an appropriate comparison of
odds-ratios for two continuous predictors on different scales.

Another possibility would be to look at the change in model deviance
attributed to both variables.

Another would be to calculate something called the structure coefficient.
To do this, I think one calculates the predicted values for the outcome
(predicted probabilities in this case), and correlating these predicted
probabilities with each independent variable. Those with the strongest
correlations would be the ones contributing most to the model.

I'm not an expert on this topic, so I welcome anyone willing to offer
corrections to my comments; I'd like to better understand issues.
___________________________________________________________________
Bryan W. Griffin
Phone: 912-681-0488
E-Mail: bwgriffin@gasou.edu
WWW: http://www2.gasou.edu/edufound/bwgriffin/

*********************************************

From: Darran Caputo <dcaputo@banet.net>

In response to (1).

- I find the most straightforward way to understand the explanatory power of you model is a confusion matrix.

- A confusion matrix is a 2-by-2 of Actual and Predicted values of the outcome under study.

- A confusion matrix is a function of the CUTOFF chosen to classify observations.

-The question is: How well does the classifier perform for a given CUTOFF.

- Once you choose a given CUTOFF your confusion matrix is determined.

- This leads to many interesting statistics:

a. Sensitivity  = (#actual positive and predicted positive)/(# actual positives) ~ P(PP|AP)

b. Specificity =  (#actual positive and predicted positive)/( #predicited positives) ~ P(AP|PP)


The next step is to plot Sensitivity on the vertical axis and 1-Specificity on the horizontal axis for cutoffs (predicted probabilities between 0 and 1). This curver is called a Receiver Operator Curve (ROC).  The area under the curve is equivalent to the C-statistic reported in SAS Proc Logistic.  The C-statistic ranges from .5 to 1 where C=1 for a perfect model and C=.5 for a model no better than random classification.

***********************

A better analogy of R2 is Nagelkerke's R2 (many other authors also had a hand in this).  It is 1 - exp(-LR/n) where LR is the likelihood ratio chi-square for the whole model and n is the number of observations (not the number of "events").

See

@article{nag91not,   author = {Nagelkerke, N. J. D.},   journal = {Biometrika},   pages = {691-692},   title = {A note on a general definition of the coefficient of determination},   volume = {78},   year = {1991},   annote = {predictive accuracy; maximum likelihood

The index used by SPSS is a mixture of two different types of chi-squares. Even if it only used LR chi-squares (partial LR and total LR), there is a problem.

See  @article{sch90,   author = "Schemper, M.",   journal = BKA,   pages = "216-218",   title = "The explained variation in proportional hazards regression (correction in 81:631, 1994)",   volume = "77",   year = "1990"

and  @article{sch92fur,   author = "Schemper, M.",   journal = BKA,   pages = "202-204",   title = "Further results on the explained variation in   proportional hazards regression",   volume = "79",   year = "1992"

(BKA = Biometrika)

----------------------------------------------------------------------------

Frank E Harrell Jr
Professor of Biostatistics and Statistics
Division of Biostatistics and Epidemiology
Department of Health Evaluation Sciences
University of Virginia School of Medicine
hesweb1.med.virginia.edu/biostatistics.html


References:

Hosmer, D. W. & Lemeshow, S. (1989) Applied logistic analysis. New York: Wiley.


Return


 

Last revised: 04/08/02