Lab 5: More regression diagnostics.

NOTE: if you’re having trouble getting the right results because of data issues, you can download the correct version, including the H2Odens variable, at Data_2006\NR245\NR245_backup.mdb\BG_GF_LC_census2 and then skip ahead to the regression

This week we’ll do another set of regressions. The first regression looks at the predictors of tree cover at the block group level.

  1. In Splus import your BG_GF_Census layer that you used in lab 3. Let’s run a simple univariate regression to see how population density (POP00.SQMI) predicts tree cover. First go to Options>>general settings and under the computations tab change Print Digits to 10. So, go to statistics>>regression>>linear and input: P.coarse.veg~ POP00.SQMI. Is the variable significant? Report its coefficient, t-statistic and the model’s R-squared. What does this result mean in terms of the expected effect of population density on tree cover? Now add some more variables to make it multivariate. Input P.coarse.veg~ POP00.SQMI+ P.HS.+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp. The variables mean population density, percent of high school graduates, median household income, percent single family detached houses, percent protected land in the block group, distance to downtown and distance to highway onramp.  Copy and paste the results. What happened to R-Squared? Briefly describe which variables are significant and whether they have a positive or negative influence on the dependent variable.
  2. We know that water bodies greatly influence the distribution of trees, so let’s see if such a variable has any effect. Open Arc Catalog and copy from the share drive a hydrology layer to your geodatabase (Data_2006\Database\Hydrology\Hydrology.mdb\Surface_Water\NHDlinesGF; remember to use copy>>paste in Arc Catalog). Load up BG_GF_Census plus your hydro layer in Arc Map. Open the Spatial Analyst toolbar and hit Spatial analyst>>density. Choose NHD_lines_GF as your input, none as population field, kernel as density type, search radius as 500 meters, area units as square kilometers, and 30 meters as output cell size. Leave the output raster as temporary. Then, Using zonal statistics (Spatial analyst toolbar>>zonal statistics) to summarize average stream density by block group.

Your zonal stats should look something like this. . Now, in the BG_GF_Census attribute table, create a new field called “H2Odens” set to double and use the field calculator to fill in the values so they are equal to densitywat.MEAN. Then remove BG_GF_Census from Arc Map.

  1. Now let’s redo the regression. In S Plus go to statistics>>regression>>linear. Using BG.GF.census2 as your data set do:

 P.coarse.veg~ H2Odens+P.HS.+POP00.SQMI+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp. Before hitting apply, choose to do the residuals vs fit and sqrt abs residual vs fit and Residuals Normal QQ plots under the residuals tab (see chapter 6 of the SPlus Guide to Statistics volume 1 for more information on normality plots; help>>online manuals). Also, under the “model” tab, where it says “save as”, type lin1. Hit apply.

    1. cut and paste the table of coefficients
    2. Report the R-squared value and compare to the previous model (the same, but without H2Odens)
    3. What happened to the population density variable? If a change happened, explain why think that might be.
    4. take a screencapture of each of the residual plots.
    5. describe the pattern you see. What might these patterns indicate? What does the QQ plot suggest about normality?
    6. Interpret the coefficient on d2ramp. Assuming the units are meters what does a one unit change in that coefficient mean in terms of a change in the response variable? Remember that the dependent variable is a percentage.
    7. Run a shapiro Wilke test for normality. At the command line type shapiro.test (lin1$resid). Report the W and P values. A p value less than .05 means we can reject the null hypothesis of normality at the 95% confidence level.
  1. You’ll note that some variables are now insignificant. Just to show how transformations of an independent variable can impact other independent variables, go ahead and log the distance to downtown variable. We do this, because distance to downtown can get quite large as you go far north in the watershed, so presumably this scale problem could be biasing our results. Logging the variable will put it into a much smaller scale should reduce the instability of the model. To do this, just type in the model formula window, changing d2down to log(d2down). In “save as” change the model name to linlog1. If you look at the plots, you’ll note that this has little effect on nonconstant variance or normality. But it does affect the coefficients and R-squared. Report the new R-squared and whether the d2down variable is now significant.
  2.  Now, let’s try logging the reponse variable. Do that by simply typing in the model formula box to change P.coarse.veg to log(P.coarse.veg), also changing “save as” to loglog1.  Run the model again and look at how the residuals have changed. Describe the change and what it might signify. What happened to R-squared, though? This transformation helps in some ways, but hurts in others. Let’s try to figure out the best transformation. To do that, we’ll use a method called a BoxCox Transformation. Here’s some good background on that method: http://jas.fass.org/cgi/reprint/76/3/847.pdf . This method is used to find an optimal power transformation of a variable where variables in a linear function are transformed according to

                                       The parameter  is estimated through maximum likelihood to find the optimal transformation of a variable, in this case the dependent. This should yield the transformation that allows the model to best meet many of the assumptions of linear regression, such as normality and constant variance. Once that parameter is estimated, the dependent variable can be transformed according to that equation, although there are various “special cases” where =1, =0, =.5, or =-1. in which the model becomes one of the classic functional forms: 1 corresponds to the linear model, 0 to natural log transformation, -1 to reciprocal transformation, and .5 to square root transformation. The Box Cox transformation is generally used to transformation the dependent (and sometimes independent) variable so it optimizes the model, but it becomes hard to interpret the coefficients. So, if your lambda is near one of those values for the special cases, it’s often common to simply do that “special case” transformation. In this sense, the Box Cox becomes like a diagnostic to help us determine the optimal transformation of the data. Open the command window in S Plus (window>>command window) and then copy the following into the command window, making sure to include the last bracket

boxcox<-

function(g, i=seq(-2,2,by=0.1))

{

  y <- g$fit + g$res

  x <- model.matrix(g)

  qrf <- lsfit(x, y, int=F)$qr

  sly <- sum(log(y))

            n <- length(y)

            ll <- rep(0, length(i))

            for(j in 1:length(i)) {

                        l <- i[j]

                        if(l == 0)

                                    ty <- log(y)

                        else ty <- (y^l - 1)/l

                        cat(dim(qrf$qr$qr))

                        res <- qr.resid(qrf, ty)

                        ll[j] <- ( - n * log(sum(res^2)/n))/2 + (l - 1) * sly

            }

            list(lambda = i, loglik = ll)

}

 

Then in the commands window type (not including the first prompt arrow on each line)

> a <-boxcox(linlog1)

>plot (a$lambda, a$loglik)

Take a look at the resulting parabola. At what lambda does the curve appear to hit its max? To test for sure type

> a$lambda[a$loglik==max(a$loglik)]

  1. Which of the 4 critical values named above is it nearest? If it requires a transformation (that is, not linear), then transform the left side of the equation using the appropriate transformation (we could transform both sides, but in this case only the response really needs it; however do make sure that d2down is still logged as in your previous regressions). The syntax for log transform of the dependent variable would be log(P.coarse.veg) and square root would be sqrt(P.coarse.veg) and reciprocal would be 1/(P.coarse.veg). If it suggests a linear model, then no need to do anything. Before hitting apply, under save as in the window, type transform. Again choose to plot out residuals vs fit, abs sqr rt residuals vs. fit and Residual Normal QQ plots AND also go to the results tab and choose to save the residuals in your BG.BF.Census layer. Hit Apply.  If there are other columns already named residuals in your BG.GF.Census table, you may want to rename that newly created column (right click>>properties), to something like residbox. Now:
    1. Screencapture each plot
    2. Describe the differences in the plots with previous plots and whether you think this model appears to be “better”
    3. Describe whether there were any differences in R-squared, or sign or significance of the coefficients.
  2. While this improves things, it does not totally solve our problems, though. First, our residuals are still somewhat non-normal. The way we can confirm this is by doing another Shapiro-Wilke test for normality. At the command line type shapiro.test (transform$resid). What does ours say? Based on the p value, does the data seem to be more or less normally distributed than the last time we ran this test, in step 3?
  3. Does this solve spatial autocorrelation?  Let’s quickly check this out in S-Plus. Go to file>>load module and click spatial then OK. Go to Spatial>>spatial neighbors. Check nearest neighbors radio button, BG.GF.Census under Data set, X.Centroid under variable 1, Y.Centroid under variable 2, neighborBG1, under Save in, and everything else as is. Click OK. Then go to Spatial>>spatial correlations, choose BG.GF.Census as the data set, residbox (or whatever you called the residuals column you just created) as the variable) and neighborBG1 as the neighbor object. Keep everything else the same and click apply. Note these results only give you a Moran statistic and P value. What do these results suggest?
  4. Using last week’s instructions, output the residuals from your transformed model to a new table. Remember to only output two columns—residuals and BKGKEY. Join it to your BG_GF_Census layer in Arc Map and then plot out the residuals and take a screencapture. If you want, you can also compare the Moran’s I Z scores of residuals for last week’s and this week’s models. If you do, you’ll note that this week’s model results in a slightly lower z score, but both are still highly autocorrelated.
  5. Submit and upload your document.