Lab 5: More regression diagnostics.

Due Feb 22

This week we’ll do another set of regressions. The first regression looks at the predictors of tree cover at the block group level.

  1. In Splus import the download.mdb geodatabase from earlier. Let’s run a simple univariate regression to see how population density (POP00.SQMI) predicts tree cover. First go to Options>>general settings and under the computations tab change Print Digits to 10. So, go to statistics>>regression>>linear and input: P.coarseveg~POP00.SQMI. The variable should be highly significant. Q1. Interpret the coefficient: [Q1] Given that population density is in people per square mile and p.coarseveg is a percentage (where 0.1= 10%), what predicted effect would a 10,000 person/sq mile increase have on the percentage of tree cover? Now add some more variables to make it multivariate. Input P.coarseveg~POP00.SQMI + P.HS.+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp. The variables mean population density, percent of high school graduates, median household income, percent single family detached houses, percent protected land in the block group, distance to downtown and distance to highway onramp.  Copy and paste the results. [Q2] Cut and paste the results. What is the R-Squared? Put a single asterisk next to each term that is significant at the 95% confidence level and two asterisks next to each term that is significant at the 99% confidence level (they should go to the right of the p values).
  2. We know that water bodies greatly influence the distribution of trees, so let’s see if such a variable has any effect. Open Arc Catalog and copy from the share drive a hydrology layer to your geodatabase (NR245\lab_data.mdb\NHD_Lines_GF); remember to use copy>>paste in Arc Catalog). Make sure to do this step in C:\temp.  Load up BG_GF_Census plus your hydro shapefile in Arc Map. Go to Spatial analyst>>density>>kernel density. Choose NHD_lines_GF as your input, none as population field, search radius as 500 meters, area units as square kilometers, and 30 meters as output cell size. The click on "Environment settings" >>"raster analysis" and set the mask to MG_GF_Census. Save the output raster as hydrodens. Then, Using zonal statistics as table (Spatial analyst>>zonal>>zonal statistics as table) to summarize average stream density by block group. Use BG_GF_Census as the zone dataset, BKG_KEY_1 as the zone field, and your hydrodens as your value raster. Save the output table as something like “waterzonal.” Then join the resulting table to BG_GF_Census . Now, in the BG_GF_Census attribute table, create a new field called “H2Odens” set to double and use the field calculator to fill in the values so they are equal to waterzonal.MEAN. To ensure that you have done this correctly, open the attribute table and sort it ascending by Block Group. The scroll over to the right and look at the H2Odens field. If you do sum of the row (right click>>statistics) it should be around 183 or so. Now, do a graduated color map with 5 classes of H2Odens by block group, using green to blue color ramp and take a screencapture.
  3. Now let’s redo the regression. In S Plus re-import the data, making sure to give the table a new name, like BG.GF.Census2 (name it in the “To data set” in the import interface). First, let’s do a correlation matrix to see how correlated the different X variables are. Go to statistics>>data summaries>>correlations and control-click to select all your X variables. Then change Method to Handle Missing Values to Omit. [Q3] Report the correlation coefficient between population density and H2O density. Then go to statistics>>regression>>linear. Do:

 P.coarseveg~ H2Odens+P.HS.+POP00.SQMI+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp. Before hitting apply, choose to do the residuals vs fit and Residuals Normal QQ plots under the residuals tab (see chapter 6 of the SPlus Guide to Statistics volume 1 for more information on normality plots; help>>online manuals). Also, under the “model” tab, where it says “save as”, type lin1. Hit apply. [Q4]

    1. cut and paste the table of coefficients, putting one, two, or no asterisks by each row, as above.
    2. What happened to the population density variable? If a change happened, explain why think that might be.

Now take a screencapture of each of the residual plots. Q5. Describe the pattern you see on the residuals plot. What problem does it suggest?

Run a shapiro Wilke test for normality. At the command line type shapiro.test (lin1$resid). Q6. Report the W and P values. What does this tell you about the normality of the data?

  1. You’ll note that some variables are now insignificant. Just to show how transformations of an independent variable can impact other independent variables, go ahead and log the distance to downtown variable. We do this, because distance to downtown can get quite large as you go far north in the watershed, so presumably this scale problem could be biasing our results. Logging the variable will put it into a much smaller scale should reduce the instability of the model. To do this, just type in the model formula window, changing d2down to log(d2down). In “save as” change the model name to linlog1. If you look at the plots, you’ll note that this has little effect on nonconstant variance or normality. But it does affect the coefficients and R-squared. Q7. Is the d2down variable now significant and if so, at what significance level? What is the R squared?
  2.  Now, let’s try logging the reponse variable. Do that by simply typing in the model formula box to change P.coarse.veg to log(P.coarse.veg), also changing “save as” to loglog1.  Run the model again and look at how the residuals have changed. Describe the change and what it might signify. What happened to R-squared, though? This transformation helps in some ways, but hurts in others. Let’s try to figure out the best transformation. To do that, we’ll use a method called a BoxCox Transformation. Here’s some good background on that method: . This method is used to find an optimal power transformation of a variable where variables in a linear function are transformed according to

                                       The parameter  is estimated through maximum likelihood to find the optimal transformation of a variable, in this case the dependent. This should yield the transformation that allows the model to best meet many of the assumptions of linear regression, such as normality and constant variance. Once that parameter is estimated, the dependent variable can be transformed according to that equation, although there are various “special cases” where =1, =0, =.5, or =-1. in which the model becomes one of the classic functional forms: 1 corresponds to the linear model, 0 to natural log transformation, -1 to reciprocal transformation, and .5 to square root transformation. The Box Cox transformation is generally used to transformation the dependent (and sometimes independent) variable so it optimizes the model, but it becomes hard to interpret the coefficients. So, if your lambda is near one of those values for the special cases, it’s often common to simply do that “special case” transformation. In this sense, the Box Cox becomes like a diagnostic to help us determine the optimal transformation of the data. Open the command window in S Plus (window>>command window) and then copy the following into the command window, making sure to include the last bracket


function(g, i=seq(-2,2,by=0.1))


  y <- g$fit + g$res

  x <- model.matrix(g)

  qrf <- lsfit(x, y, int=F)$qr

  sly <- sum(log(y))

            n <- length(y)

            ll <- rep(0, length(i))

            for(j in 1:length(i)) {

                        l <- i[j]

                        if(l == 0)

                                    ty <- log(y)

                        else ty <- (y^l - 1)/l


                        res <- qr.resid(qrf, ty)

                        ll[j] <- ( - n * log(sum(res^2)/n))/2 + (l - 1) * sly


            list(lambda = i, loglik = ll)



Then in the commands window type (not including the first prompt arrow on each line)

> a <-boxcox(linlog1)

>plot (a$lambda, a$loglik)

Take a look at the resulting parabola. At what lambda does the curve appear to hit its max? To test for sure type

> a$lambda[a$loglik==max(a$loglik)]


Take a screencapture of the plot. [Q8] What is the optimum lambda value and which transformation does it suggest? If it requires a transformation (that is, not linear), then transform the left side of the equation using the appropriate transformation (we could transform both sides, but in this case only the response really needs it; however do make sure that d2down is still logged as in your previous regressions). The syntax for log transform of the dependent variable would be log(P.coarse.veg) and square root would be sqrt(P.coarse.veg) and reciprocal would be 1/(P.coarse.veg). If it suggests a linear model, then no need to do anything. Before hitting apply, under "save as" in the window, type "transform." Again choose to plot out residuals vs fit and Residual Normal QQ plots AND also go to the results tab and choose to save the residuals in your BG.BF.Census layer. Hit Apply.  If there are other columns already named residuals in your BG.GF.Census table, you may want to rename that newly created column (right click>>properties), to something like residbox. Now: [Q9]

    1. Describe the differences in the plots with previous plots and whether you think this model appears to be “better”
    2. Describe whether there were any differences in R-squared, or sign or significance of the coefficients.
  1. While this improves things, it does not totally solve our problems, though. First, our residuals are still somewhat non-normal. The way we can confirm this is by doing another Shapiro-Wilke test for normality. At the command line type shapiro.test (transform$resid). What does ours say? Based on the p value, does the data seem to be more or less normally distributed than the last time we ran this test, in step 3? (no need to answer)
  2. Does this solve spatial autocorrelation?  Let’s quickly check this out in S-Plus. Go to file>>load library and click spatialstat then OK. Go to Spatial>>spatial neighbors. Check nearest neighbors radio button, BG.GF.Census under Data set, X.Centroid under variable 1, Y.Centroid under variable 2, neighborBG1, under Save in, and everything else as is. Click OK. Then go to Spatial>>spatial correlations, choose BG.GF.Census as the data set, residbox (or whatever you called the residuals column you just created) as the variable) and neighborBG1 as the neighbor object. Keep everything else the same and click apply. Note these results only give you a Moran statistic and P value. [Q10] Present your Moran statistic results. What do these results suggest?
  3. Output the residuals from your transformed model to a new table by clicking the Results tab in your regression and checking Residuals and choosing to save in BG.GF.Census2. Then, go to file>>export data>to file and save as a Dbase. Only output two columns—residuals and BKGKEY—do this under the Filter tab. Join the table  it to your BG_GF_Census layer in Arc Map and then plot out the residuals and take a screencaputre.   If you want, you can also compare the Moran’s I Z scores of residuals for last week’s and this week’s models. If you do, you’ll note that this week’s model results in a slightly lower z score, but both are still highly autocorrelated.
  4. Submit and upload your document.