Lab 9. Spatial clustering, PCA and spatial lag regression

Due Wed April 11.

1.       Download BG_BACI_BACO.shp, BG_BACI_BACO_Blank.shp, and BGBACIBACO.sav from the NR 245/lab9 directory online into C:\temp


2.       Spatial Cluster analysis. Open SAM 4.0 and go to file>>open>>open shapefile and load BG_BACI_BACO.shp. You should see an interface called "data settings." (If you don't, click on the fifth icon from the left, which looks like a stack of papers). Go to the connectivity matrix tab. Click "create/edit". Click the connectivity criterion tab and chose "Gabriel criterion." Now click "Create" the "close."  Then click Structure>>Cluster and Spatial Cluster.  In the interface, control-click to choose the following variables. P_BLK, P_BACH,MED_AGE, MED_HH_INC, P_OWNOCC, P_SFDH, P_PROTLAND, and P_HH_RUR. Choose 6 clusters, and make sure "spatially constrained is checked," with "Gabriel criterion" selected below it. Keep everything else the default.  Click Calculate. This might take a while. Go have a cup of coffee…or you can start on the next thing while this is running. Q1. Present the group size of each cluster.   Don’t click on the graphical results—it will probably crash.  Instead, hit the X on the cluster interface to close it. It will then ask you if you want to save the 1 unsaved variable. Click Yes. Call the field Cluster.  Then click on the data save as/exportation button (fourth icon from the left on the main menu). Uncheck everything but joinID and Cluster.  Save the output as DBF file format. Click the little disk icon and browse to your lab 9 directory to save it there and call it clusters. Then click Export.  Now open ArcMap and join the B_BACI_BACO_blank layer to this table using JoinID. Now plot out and scerencpature the Cluster field using unique values symbology and a high contrast color scheme.


3.       Principal Components/Factor Analysis. Open SPSS/ PASW statistics. When it prompts to open a file at the beginning, choose   BGBACIBACO.sav. Click Analyze>>Dimension reduction>>factor. Now let's choose variables that will be dimensionally reduced into an index that relates to socio-economic population characteristics. Shift-click the following, then click the right arrow to add to the variables window: MED_HH_INC, P_OWNOCC, P_VAC, P_SFDH, MED_VAL_AL, P_BLK, PBACH, P_transit, PEMP, P_HH_RUR.  Then click on "Descriptives" and check "Coefficients" and "Significance levels" then "continue." Next click on "extraction" and under "based on eigenvalue" choose greater than 0.6, then "continue." Then choose "scores" and "Save as variables" and "continue." Then click "OK". Now interpret what you see. Q2. Look at the correlation matrix. What two variables have the highest correlation (excluding the 1’s)? Look at the Communalities table. Which variable has the highest proportion of its variance explained by these principal components? Then look at the Total Variance Explained table. How many principal components are above an Eigenvalue of .6? Cut and paste that table and report what percentage of the cumulative variance is explained by those components that are above .6.  Finally, look at the component matrix and describe one variable that has a strong negative influence on component 1 and one that has a strong positive influence. Now look at your data matrix. You should see several new columns. Those are your principal components. Now, let’s export this.  Go to file>>save as and in the “save as type” choose comma delimited (csv). Then hit save. Now, let’s quickly map those out.  Open Arc Map and load BG_BACI_BACO_blank and join this new table use joinID. Plot out and screencapture Factor 1 using graduated color symbology. Think about what this map is telling you in light of the factor loadings, although you don’t need to report it. Now export this map from Arc map (right click in table of contents>>data>>Export data)and save the output as a shapefile called PCA.shp in your lab 9 directory.


4.       Spatial Lag Regression. Now open the new version of OpenGeoDa, which you will download from the NR245/lab9 directory. Save it in C:\temp or an external drive. Double click it and it should run. Go to file>>open shapefile. Browse to PCA and open it. Try plotting out a variable on the map. Go to Map>>standard deviation and then choose lncrime and OK. (no need to screencapture). Now go to Tools>>weights>>create.  Choose the Weights File ID variable as ObjectID. Then, click on Rook Contiguity and keep the order at 1. Click “Create” and name the matrix  “PCArook.” Next, click Methods>>regress. Check the Moran’s I value box. Click OK. Now let’s run the regression. We’ll start with a linear regression.  Choose lncrime (log of crime) as the depedent.  Choose your four PCs plus  TC_E_P (tree percentage), P_Agpr (percent agriculture), and P_protland as independent variables.  Choose the type of regression as “classic”. Check “weights file” and choose PCArook. Then click “run”. Look at the output . When it’s done, click “Save to table” and choose the residuals to save to your table. Then click “results” and look at the output. First, take a look at some of the regression diagnostics. Q3. What’s the R-squared? Are all variables significant level? Report the Multi-collinearity condition number. What does this tell us about our model based on what we learned in class? Why would we expect this result given our use of PCs?  Next, answer some questions about the spatial diagnostics for the model.  Report the Moran’s I test on the error (residuals), what’s its significance and what does this tell us? What about the Lagrange multiplier tests on lag and error?  What does the “Diagnostics for Spatial Dependence for Weight Matrix” section tell you about whether the spatial lag or spatial error model is likely to be better and why? 


Next we’ll run a spatial error regression. Keep everything the same including having the checked weights box. Then choose Spatial error as the model and hit run.   It will calculate for a while. When it’s done, click “Save to table” and choose the residuals. Now run the spatial lag model. Again save to table the residuals.  This time, view the results. The new results window will give you all the results for models you’ve run so far. Q4. Report the R-squared for the two spatial models. Which is highest?  Furthermore, what do the Akaike Info Criterion and Log Likelihoods from the models tell you about which model is probably the best?  Is this consistent with the message given by the diagnostics from the OLS? Next, report the coefficients on the autocorrelation parameters for both the error and lag models and if they are significant (note that the error model autocorrelation parameter appears in the variable/coffiecients table as Lambda and the lag parameter appears in two places: as Lag coeff/Rho under the summary output and as W_lncrime in the top  of the variables table). Report what happened to the tree variable TC_E_P between the three models. Why do you think that difference might be?


Finally, Plot out a LISA of the three residuals you saved by clicking space>>univariate LISA. Do one for each residual, choosing to plot out only the cluster map. Take a screen capture of each. In what way do these maps show that the two spatial models are a clear improvement on the original OLS model?


5.       Package everything up and upload.