**Lab 5: More regression diagnostics.**

**Due Feb 22**

This week we’ll do another set of regressions. The first regression looks at the predictors of tree cover at the block group level.

- In Splus
import the download.mdb geodatabase from earlier. Let’s run a simple
univariate regression to see how population density (POP00.SQMI) predicts
tree cover. First go to Options>>general settings and under the
computations tab change Print Digits to 10. So, go to
statistics>>regression>>linear and input:
P.coarseveg~POP00.SQMI. The variable should be highly significant. Q1.
Interpret the coefficient: [Q1
**] Given that population density is in people per square mile and p.coarseveg is a percentage (where 0.1= 10%), what predicted effect would a 10,000 person/sq mile increase have on the percentage of tree cover?**Now add some more variables to make it multivariate. Input P.coarseveg~POP00.SQMI + P.HS.+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp. The variables mean population density, percent of high school graduates, median household income, percent single family detached houses, percent protected land in the block group, distance to downtown and distance to highway onramp. Copy and paste the results.**[Q2] Cut and paste the results.****What is the R-Squared? Put a single asterisk next to each term that is significant at the 95% confidence level and two asterisks next to each term that is significant at the 99% confidence level (they should go to the right of the p values).** - We
know that water bodies greatly influence the distribution of trees, so
let’s see if such a variable has any effect. Open Arc Catalog and copy
from the share drive a hydrology layer to your geodatabase
(NR245\lab_data.mdb\NHD_Lines_GF); remember to use copy>>paste in
Arc Catalog). Make sure to do this step in C:\temp. Load up BG_GF_Census plus your hydro
shapefile in Arc Map. Go to Spatial analyst>>density>>kernel
density. Choose NHD_lines_GF as your input, none as population field,
search radius as 500 meters, area units as square kilometers, and 30
meters as output cell size. The click on "Environment settings"
>>"raster analysis" and set the mask to MG_GF_Census. Save
the output raster as hydrodens. Then, Using zonal statistics as table
(Spatial analyst>>zonal>>zonal statistics as table) to
summarize average stream density by block group. Use BG_GF_Census as the
zone dataset, BKG_KEY_1 as the zone field, and your hydrodens as your
value raster. Save the output table as something like “waterzonal.” Then
join the resulting table to BG_GF_Census . Now, in the BG_GF_Census
attribute table, create a new field called “H2Odens” set to double and use
the field calculator to fill in the values so they are equal to
waterzonal.MEAN. To ensure that you have done this correctly, open the
attribute table and sort it ascending by Block Group. The scroll over to
the right and look at the H2Odens field. If you do sum of the row (right
click>>statistics) it should be around 183 or so.
**Now, do a graduated color map with 5 classes of H2Odens by block group, using green to blue color ramp and take a screencapture.** - Now
let’s redo the regression. In S Plus re-import the data, making sure to
give the table a new name, like BG.GF.Census2 (name it in the “To data
set” in the import interface). First, let’s do a correlation matrix to see
how correlated the different X variables are. Go to statistics>>data
summaries>>correlations and control-click to select all your X
variables. Then change Method to Handle Missing Values to Omit.
**[Q3] Report the correlation coefficient between population density and H2O density.**Then go to statistics>>regression>>linear. Do:

P.coarseveg~
H2Odens+P.HS.+POP00.SQMI+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp.
Before hitting apply, choose to do the residuals vs fit and Residuals Normal QQ
plots under the residuals tab (see chapter 6 of the SPlus Guide to Statistics
volume 1 for more information on normality plots; help>>online manuals).
Also, under the “model” tab, where it says “save as”, type lin1. Hit apply. **[Q4]**

**cut and paste the table of coefficients, putting one, two, or no asterisks by each row, as above.****What happened to the population density variable? If a change happened, explain why think that might be.**

**Now take a screencapture of each of the residual plots. Q5. Describe
the pattern you see on the residuals plot.** **What problem does it suggest?**

Run a shapiro Wilke test for
normality. At the command line type shapiro.test (lin1$resid). **Q6**. **Report the W and P values. What does this tell you about the normality
of the data?**

- You’ll
note that some variables are now insignificant. Just to show how
transformations of an independent variable can impact other independent
variables, go ahead and log the distance to downtown variable. We do this,
because distance to downtown can get quite large as you go far north in
the watershed, so presumably this scale problem could be biasing our
results. Logging the variable will put it into a much smaller scale should
reduce the instability of the model. To do this, just type in the model
formula window, changing d2down to log(d2down). In “save as” change the
model name to linlog1. If you look at the plots, you’ll note that this has
little effect on nonconstant variance or normality. But it does affect the
coefficients and R-squared.
**Q7. Is the d2down variable now significant and if so, at what significance level? What is the R squared?** - Now, let’s try logging the reponse variable. Do that by simply typing in the model formula box to change P.coarse.veg to log(P.coarse.veg), also changing “save as” to loglog1. Run the model again and look at how the residuals have changed. Describe the change and what it might signify. What happened to R-squared, though? This transformation helps in some ways, but hurts in others. Let’s try to figure out the best transformation. To do that, we’ll use a method called a BoxCox Transformation. Here’s some good background on that method: http://jas.fass.org/cgi/reprint/76/3/847.pdf . This method is used to find an optimal power transformation of a variable where variables in a linear function are transformed according to

The parameter is estimated through maximum likelihood to find the optimal transformation of a variable, in this case the dependent. This should yield the transformation that allows the model to best meet many of the assumptions of linear regression, such as normality and constant variance. Once that parameter is estimated, the dependent variable can be transformed according to that equation, although there are various “special cases” where =1, =0, =.5, or =-1. in which the model becomes one of the classic functional forms: 1 corresponds to the linear model, 0 to natural log transformation, -1 to reciprocal transformation, and .5 to square root transformation. The Box Cox transformation is generally used to transformation the dependent (and sometimes independent) variable so it optimizes the model, but it becomes hard to interpret the coefficients. So, if your lambda is near one of those values for the special cases, it’s often common to simply do that “special case” transformation. In this sense, the Box Cox becomes like a diagnostic to help us determine the optimal transformation of the data. Open the command window in S Plus (window>>command window) and then copy the following into the command window, making sure to include the last bracket

boxcox<-

function(g, i=seq(-2,2,by=0.1))

{

y <- g$fit + g$res

x <- model.matrix(g)

qrf <- lsfit(x, y, int=F)$qr

sly <- sum(log(y))

n <- length(y)

ll <- rep(0, length(i))

for(j in 1:length(i)) {

l <- i[j]

if(l == 0)

ty <- log(y)

else ty <- (y^l - 1)/l

cat(dim(qrf$qr$qr))

res <- qr.resid(qrf, ty)

ll[j] <- ( - n *
log(sum(res^2)/n))/2 + (l - 1) * sly

}

list(lambda = i, loglik = ll)

}

Then in the commands window type (not including the first prompt arrow on each line)

> a <-boxcox(linlog1)

>plot (a$lambda, a$loglik)

Take a look at the resulting parabola. At what lambda does the curve appear to hit its max? To test for sure type

> a$lambda[a$loglik==max(a$loglik)]

**Take a screencapture of the plot.** **[Q8] What is the optimum lambda value and which transformation does it
suggest?** If it requires a transformation (that is, not linear), then
transform the left side of the equation using the appropriate transformation
(we could transform both sides, but in this case only the response really needs
it; however do make sure that d2down is still logged as in your previous
regressions). The syntax for log transform of the dependent variable would be
log(P.coarse.veg) and square root would be sqrt(P.coarse.veg) and reciprocal
would be 1/(P.coarse.veg). If it suggests a linear model, then no need to do
anything. Before hitting apply, under "save as" in the window, type
"transform." Again choose to plot out residuals vs fit and Residual
Normal QQ plots AND also go to the results tab and choose to save the residuals
in your BG.BF.Census layer. Hit Apply.
If there are other columns already named residuals in your BG.GF.Census
table, you may want to rename that newly created column (right
click>>properties), to something like residbox. Now: **[Q9]**

**Describe the differences in the plots with previous plots and whether you think this model appears to be “better”****Describe whether there were any differences in R-squared, or sign or significance of the coefficients.**- While this improves things, it does not totally solve our problems, though. First, our residuals are still somewhat non-normal. The way we can confirm this is by doing another Shapiro-Wilke test for normality. At the command line type shapiro.test (transform$resid). What does ours say? Based on the p value, does the data seem to be more or less normally distributed than the last time we ran this test, in step 3? (no need to answer)
- Does
this solve spatial autocorrelation?
Let’s quickly check this out in S-Plus. Go to file>>load library
and click spatialstat then OK. Go to Spatial>>spatial neighbors.
Check nearest neighbors radio button, BG.GF.Census under Data set,
X.Centroid under variable 1, Y.Centroid under variable 2, neighborBG1,
under Save in, and everything else as is. Click OK. Then go to
Spatial>>spatial correlations, choose BG.GF.Census as the data set,
residbox (or whatever you called the residuals column you just created) as
the variable) and neighborBG1 as the neighbor object. Keep everything else
the same and click apply. Note these results only give you a Moran
statistic and P value.
**[Q10] Present your Moran statistic results. What do these results suggest?** - Output
the residuals from your transformed model to a new table by clicking the
Results tab in your regression and checking Residuals and choosing to save
in BG.GF.Census2. Then, go to file>>export data>to file and save
as a Dbase. Only output two columns—residuals and BKGKEY—do this under the
Filter tab. Join the table it to
your BG_GF_Census layer in Arc Map and then plot out the residuals and
**take a screencaputre**. - Submit and upload your document.