Lab 5: More regression diagnostics.
NOTE: if you’re having trouble getting the right results because of
data issues, you can download the correct version, including the H2Odens
variable, at Data_2006\NR245\NR245_backup.mdb\BG_GF_LC_census2 and then skip ahead
to the regression
This week we’ll do another set of regressions.
The first regression looks at the predictors of tree cover at the block group
level.
- In Splus import your BG_GF_Census
layer that you used in lab 3. Let’s run a simple univariate
regression to see how population density (POP00.SQMI) predicts tree cover.
First go to Options>>general settings and under the computations tab
change Print Digits to 10. So, go to
statistics>>regression>>linear and input: P.coarse.veg~
POP00.SQMI. Is the variable significant? Report its coefficient,
t-statistic and the model’s R-squared. What does this result mean in terms
of the expected effect of population density on tree cover? Now add some
more variables to make it multivariate. Input P.coarse.veg~
POP00.SQMI+ P.HS.+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp.
The variables mean population density, percent of high school graduates,
median household income, percent single family detached houses, percent
protected land in the block group, distance to downtown and distance to
highway onramp. Copy and paste the
results. What happened to R-Squared? Briefly describe which variables are
significant and whether they have a positive or negative influence on the
dependent variable.
- We
know that water bodies greatly influence the distribution of trees, so
let’s see if such a variable has any effect. Open Arc Catalog and copy
from the share drive a hydrology layer to your geodatabase
(Data_2006\Database\Hydrology\Hydrology.mdb\Surface_Water\NHDlinesGF;
remember to use copy>>paste in Arc Catalog). Load up BG_GF_Census plus your hydro layer in Arc Map. Open
the Spatial Analyst toolbar and hit Spatial analyst>>density. Choose
NHD_lines_GF as your input, none as population
field, kernel as density type, search radius as 500 meters, area units as
square kilometers, and 30 meters as output cell size. Leave the output
raster as temporary. Then, Using zonal statistics (Spatial analyst
toolbar>>zonal statistics) to summarize average stream density by
block group.
Your zonal stats should look
something like this.
. Now, in the BG_GF_Census attribute table, create a new field called
“H2Odens” set to double and use the field calculator to fill in the values so
they are equal to densitywat.MEAN. Then remove BG_GF_Census from Arc Map.
- Now
let’s redo the regression. In S Plus go to statistics>>regression>>linear.
Using BG.GF.census2 as your data set do:
P.coarse.veg~
H2Odens+P.HS.+POP00.SQMI+MED.HH.INC+P.SFDH+P.Protland+d2down+d2ramp.
Before hitting apply, choose to do the residuals vs
fit and sqrt abs residual vs
fit and Residuals Normal QQ plots under the residuals tab (see chapter 6 of the
SPlus Guide to Statistics volume 1 for more
information on normality plots; help>>online manuals). Also, under the
“model” tab, where it says “save as”, type lin1. Hit apply.
- cut
and paste the table of coefficients
- Report
the R-squared value and compare to the previous model (the same, but
without H2Odens)
- What
happened to the population density variable? If a change happened,
explain why think that might be.
- take a screencapture of each
of the residual plots.
- describe the pattern you see. What might these
patterns indicate? What does the QQ plot suggest about normality?
- Interpret
the coefficient on d2ramp. Assuming the units are meters what does a one
unit change in that coefficient mean in terms of a change in the response
variable? Remember that the dependent variable is a percentage.
- Run
a shapiro Wilke test for normality. At the command line type shapiro.test (lin1$resid). Report the W and P values.
A p value less than .05 means we can reject the null hypothesis of
normality at the 95% confidence level.
- You’ll
note that some variables are now insignificant. Just to show how
transformations of an independent variable can impact other independent
variables, go ahead and log the distance to downtown variable. We do this,
because distance to downtown can get quite large as you go far north in
the watershed, so presumably this scale problem could be biasing our
results. Logging the variable will put it into a much smaller scale should
reduce the instability of the model. To do this, just type in the model
formula window, changing d2down to log(d2down).
In “save as” change the model name to linlog1. If you look at the plots,
you’ll note that this has little effect on nonconstant
variance or normality. But it does affect the coefficients and R-squared.
Report the new R-squared and whether the d2down variable is now
significant.
- Now, let’s try logging the reponse variable. Do that by simply typing in the
model formula box to change P.coarse.veg to log(P.coarse.veg), also
changing “save as” to loglog1. Run
the model again and look at how the residuals have changed. Describe the
change and what it might signify. What happened to R-squared, though? This
transformation helps in some ways, but hurts in others. Let’s try to
figure out the best transformation. To do that, we’ll use a method called
a BoxCox Transformation. Here’s some good
background on that method: http://jas.fass.org/cgi/reprint/76/3/847.pdf
. This method is used to find an optimal power transformation of a
variable where variables in a linear function are transformed according to
The parameter
is estimated through maximum likelihood
to find the optimal transformation of a variable, in this case the dependent.
This should yield the transformation that allows the model to best meet many of
the assumptions of linear regression, such as normality and constant variance.
Once that parameter is estimated, the dependent variable can be transformed
according to that equation, although there are various “special cases” where
=1,
=0,
=.5, or
=-1. in which the model becomes one of the
classic functional forms: 1 corresponds to the linear model, 0 to natural log
transformation, -1 to reciprocal transformation, and .5 to square root
transformation. The Box Cox transformation is generally used to transformation
the dependent (and sometimes independent) variable so it optimizes the model,
but it becomes hard to interpret the coefficients. So, if your lambda is near
one of those values for the special cases, it’s often common to simply do that
“special case” transformation. In this sense, the Box Cox becomes like a
diagnostic to help us determine the optimal transformation of the data. Open
the command window in S Plus (window>>command window) and then copy the
following into the command window, making sure to include the last bracket
boxcox<-
function(g,
i=seq(-2,2,by=0.1))
{
y <- g$fit + g$res
x <- model.matrix(g)
qrf <- lsfit(x, y, int=F)$qr
sly <- sum(log(y))
n
<- length(y)
ll <- rep(0, length(i))
for(j in 1:length(i)) {
l
<- i[j]
if(l == 0)
ty <- log(y)
else ty <- (y^l
- 1)/l
cat(dim(qrf$qr$qr))
res <- qr.resid(qrf, ty)
ll[j] <- ( - n * log(sum(res^2)/n))/2
+ (l - 1) * sly
}
list(lambda = i, loglik = ll)
}
Then in the commands window type
(not including the first prompt arrow on each line)
> a <-boxcox(linlog1)
>plot (a$lambda,
a$loglik)
Take a look at the resulting
parabola. At what lambda does the curve appear to hit its max? To test for sure
type
> a$lambda[a$loglik==max(a$loglik)]
- Which
of the 4 critical values named above is it nearest? If it requires a
transformation (that is, not linear), then transform the left side of the
equation using the appropriate transformation (we could transform both
sides, but in this case only the response really needs it; however do make
sure that d2down is still logged as in your previous regressions). The
syntax for log transform of the dependent variable would be log(P.coarse.veg) and square
root would be sqrt(P.coarse.veg)
and reciprocal would be 1/(P.coarse.veg). If it
suggests a linear model, then no need to do anything. Before hitting
apply, under save as in the window, type transform. Again choose to plot
out residuals vs fit, abs sqr
rt residuals vs. fit
and Residual Normal QQ plots AND also go to the results tab and choose to
save the residuals in your BG.BF.Census layer.
Hit Apply. If there are other
columns already named residuals in your BG.GF.Census
table, you may want to rename that newly created column (right
click>>properties), to something like residbox.
Now:
- Screencapture each plot
- Describe
the differences in the plots with previous plots and whether you think
this model appears to be “better”
- Describe
whether there were any differences in R-squared, or sign or significance
of the coefficients.
- While
this improves things, it does not totally solve our problems, though.
First, our residuals are still somewhat non-normal. The way we can confirm
this is by doing another Shapiro-Wilke test for
normality. At the command line type shapiro.test
(transform$resid). What does ours say? Based on the p value, does the data
seem to be more or less normally distributed than the last time we ran this
test, in step 3?
- Does
this solve spatial autocorrelation?
Let’s quickly check this out in S-Plus. Go to file>>load
module and click spatial then OK. Go to Spatial>>spatial neighbors.
Check nearest neighbors radio button, BG.GF.Census
under Data set, X.Centroid under variable 1, Y.Centroid under variable 2, neighborBG1, under Save
in, and everything else as is. Click OK. Then go to Spatial>>spatial
correlations, choose BG.GF.Census as the data
set, residbox (or whatever you called the
residuals column you just created) as the variable) and neighborBG1 as the
neighbor object. Keep everything else the same and click
apply. Note these results only give you a Moran statistic and P
value. What do these results suggest?
- Using
last week’s instructions, output the residuals from your transformed model
to a new table. Remember to only output two columns—residuals and BKGKEY.
Join it to your BG_GF_Census layer in Arc Map
and then plot out the residuals and take a screencapture.
If you want, you can also compare the Moran’s I Z scores of residuals for
last week’s and this week’s models. If you do, you’ll note that this
week’s model results in a slightly lower z score, but both are still
highly autocorrelated.
- Submit
and upload your document.