Lab 8

Cluster Analysis of Geographic Data

NR245

1. Open up S Plus. Open the object browser with the  button. Load the BG_GF_CENSUS2 table you used last in lab 6 (If you’re not sure which one it is, go back to Data_2006\NR245\download.mdb and download BG_GF_Census2 --this is the same as BG_GF_Census, but it has the stream density variable).
2. Now we’ll try a simple cluster analysis. Click statistics>>Cluster Analysis>>K means. . You should now get an interface where you can choose the parameters for the model. We’ll start with a cluster analysis based on two variables: percent tree (P.coarse.veg) and population density (POP00.SQMI). You can do this by control-clicking the variables you want like so. Now select the number of clusters as 4. Keep the number of Max iterations at 10. The iteration refers to the fact that k means works by classifying observations into groups iteratively, by calculating group centroids, then assigning observations to the groups with the nearest centroid, then recalculating centroids. It does this, based on n-dimensional distance, by calculating the centroid of each group and assigning each observation to a group with the closest centroid. It must do this iteratively because it can’t know where the centroid of a group is until it knows who’s in the group—a bit of a paradox. So it calculates the centroid based on a current group membership (which might start off being quite wrong), then recalculates by reassigning observations to groups based on the new centroids, and so on. Later on you can play with changing this parameter. The more iterations the more refined group membership estimates become, which can become important when you have many dimensions. [Q1] From the model output, say how many observations there are in each cluster. Also paste the small table showing mean value by cluster of percent forest and population density into your results. Give a very brief narrative description of each class (e.g. “low tree cover and medium population density”).
4. Now try adding more variables. Add MED.HH.INC and P.SFDH (percent single family detached homes) Run a bunch of clusterings with this set of variables using different cluster numbers between 3 and 6. When you find the best model, again check the “cluster membership” box and save the cluster output in your table (change the name of the heading to PAM2). Leave everything else as the default. [Q3] What is the number of classes that maximizes the silhouette score (note, there may be more than one cluster number that does this, in which case report one or all of those)? Present the silhouette score and screencapture the silhouette plot for that clustering. If you want, feel free to save more cluster membership columns for other models.
5. Let’s plot this out in Arc Map. Using the approach you learned in previous labs, export the BG.GF.Census to a file, keeping only the BG.KEY and the two cluster membership columns (Filter tab) and saving as a dbf. Then, in ArcGIS, join to your Gwynns Falls block group layer (BG_GF_Census or BG_GF_LC) and plot out both using unique values symbology (remember to click “add all values”). Use a high-contrast color palette, like this . Make a screencapture of each plot.

1. Now we’ll do clustering in SPSS using their snazzy “two step cluster analysis.” This method is better than K means for a number of reasons: 1. can create clusters based on categorical and numberic data; 2) can automatically select number of clusters based on optimization if you want, 3) is efficient with large data sets; 4) uses multi-model inferential statistics to select best number of clusters (AIC or BIC). Start by downloading a new data set from the share drive, called BGforSPSS.sav, located in the Data_2006\NR245\SPSS folder (note that you’ll download it to your NR245 folder using Windows Explorer). Although it’s very similar to BG.GF.Census, it has some slight differences (in number of records) that SPSS is very sensitive to, so please use this file. Open PASW Statistics (SPSS) 18.  Under the “What would you like to do screen” make sure the Open an Existing Data Source radio button is checked and click OK. Then browse to your BGforSPSS.sav file. On the resulting screen go to Analysis>>Classify>>Two step cluster.  Note that very good support on using clustering in SPSS can be found at http://www2.chass.ncsu.edu/garson/PA765/cluster.htm.

1. Let’s start with a simple clustering algorithm: let’s cluster based on percent coarse veg, percent fine veg, and income (MED_HH_INC) by block group. Shift click those variables in the left window and click the right arrow to bring them into the “continuous variables” box. Check Log-likelihood as the distance measure, choose “determine automatically” as the number of clusters (15 max) and choose Akaike’s Information Criterion under the clustering criterion (this is less conservative than BIC and will generally result in more classes). It should look something like:

Click OK. Now let’s look at the output. Double click the box with the output that says “Model summary.” It will open a window. Q4. First, report the number of clusters and report the percent of cases and number of cases in each cluster (hover the mouse over the pie slices to get number). Then, in the right window, change “view” to predictor importance Q5. Report which is the most important predictor of cluster membership. Then, in the left window, change the “view” to “clusters” and click on the icon above for “copy visualization data”. Then paste that visualization into your homework and (Q6) report the percent coarse veg for each cluster. Next, click on one of the P.coarseveg boxes and screencapture the histogram in the right window that results. Do the same for one of the MED.HH.Inc boxes.  Q7. Describe what these histograms are telling you. Finally, find the cluster with the highest median income and click on its heading in the top row of the table in the left window. In the right window you should see a “cluster comparison.” Paste that into your homework and [Q8] describe what it is telling you. For instance, what does the cluster comparison show you about the income of this cluster group?

1. Now we’ll add some more variables and see what happens. Do another two step cluster analysis. Do everything the same except change the variables. Get rid of p.fine.veg, but keep MED.HH.INC and P.coarse.veg. Also add P.SFDH, d2down, and Robb05. Then, add “DESC.15” (that’s PRIZM 15 with names) to the “categorical variable” list.’ Hence, we’ll be clustering based on continuous and categorical variables. Now look at the output. Q9. Report: a)the number of clusters, b)the most important predictor in terms of clustering, c) the average income of the highest income cluster, d) the number of cases in that cluster, and e)the average p.coarseveg value for that high income cluster. Then, for that same cluster, click on the DESC.15 box in the left window under that cluster's column and take a screencapture of the resulting graph in the right window. Q10. Report which PRIZM group is most represented in this high income cluster and which is most represented for the entire sample of clusters (pay attention to the legend, which tells you the difference between light and dark bars ).

9.      Assemble your text and screencaptures and make a PDF to upload.