Lab 8

Cluster Analysis of Geographic Data

NR245

 

 

  1. Open up S Plus. Open the object browser with the Description: Description: OB button. Load the BG_GF_CENSUS2 table you used last in lab 6 (If you’re not sure which one it is, go back to Data_2006\NR245\download.mdb and download BG_GF_Census2 --this is the same as BG_GF_Census, but it has the stream density variable).
  2. Now we’ll try a simple cluster analysis. Click statistics>>Cluster Analysis>>K meansDescription: Description: ca. . You should now get an interface where you can choose the parameters for the model. We’ll start with a cluster analysis based on two variables: percent tree (P.coarse.veg) and population density (POP00.SQMI). You can do this by control-clicking the variables you want like so. Now select the number of clusters as 4. Keep the number of Max iterations at 10. The iteration refers to the fact that k means works by classifying observations into groups iteratively, by calculating group centroids, then assigning observations to the groups with the nearest centroid, then recalculating centroids. It does this, based on n-dimensional distance, by calculating the centroid of each group and assigning each observation to a group with the closest centroid. It must do this iteratively because it can’t know where the centroid of a group is until it knows who’s in the group—a bit of a paradox. So it calculates the centroid based on a current group membership (which might start off being quite wrong), then recalculates by reassigning observations to groups based on the new centroids, and so on. Later on you can play with changing this parameter. The more iterations the more refined group membership estimates become, which can become important when you have many dimensions. [Q1] From the model output, say how many observations there are in each cluster. Also paste the small table showing mean value by cluster of percent forest and population density into your results. Give a very brief narrative description of each class (e.g. “low tree cover and medium population density”).   
  3. Next, let’s try the “partitioning around medoids” method. This method uses “medoids” rather then centroids. It is more robust to inclusion of outliers than k means because medoids use a more robust measure of dispersion. Medoids are also better for using ordinal variables or ratio-scaled variables, although we’re not doing that here. Most importantly, it allows for some diagnostics that can help you determine the strength of the structure of each clustering algorithm you try. In particular, it can help you assess if you’re using the right number of clusters and the right variables to define the clusters. This time we won’t save cluster memberships until we have a better idea of what makes a good cluster analysis. Before starting, however, we have to do a little data management. Open your BG.GF.Census2 table and right click in header of the first blank column to the right and click “insert column.” Under name put coarseveg and under fill expression put “P.coarseveg*100” (spelling may vary) and under column type put integer. Also, under H2ODENS, look for an NA value and if you have one, change that to a 0 (otherwise, you won’t be able to append the cluster membership column in this step). Then, go to statistics>>cluster analysis and choose partitioning around medoids. We’re going to see if we can create a clustering that will differentiate block groups based on different dimensions of tree cover: amount of tree cover, whether it’s from residential areas or riparian/park areas, whether it’s from areas where there are lots of single family homes or not.  Note that as we add more variables, we expect to need more classes to maintain fit, because there are more dimensional combinations to account for. As we add variables, we expect the fit to go down, so we can’t just go based on fit alone—if we can add variables and have only minimal losses to fit, then that’s a really good sign. Open the partitioning around medoids tool in Splus. Shift-click to select coarseveg and H2Odens.  Accept the defaults, except choose 3 as the number of clusters. Under plots tab check the cluster plot and silhouette plot. Take a look at the silhouette score on the graph. Now try the same thing for 4 through 10 clusters and  [Q2] report which number of clusters yielded the highest silhouette score and what that score was. Interpret what this silhouette plot and clusplot are telling you about this clustering, making sure to address what signs are you looking for in both plots to indicate strong clustering or weak clustering. How does the average silhouette score relate to the silhouette plot? Finally, do a screencapture of the silhouette plot and clusplot for the clustering with the highest silhouette score. Now we’ll rerun this clustering, saving the cluster membership in your table. Go to the Results tab of the clustering interface and check “cluster membership”, saving it in BG_GF_Census2. Click apply. Just to keep track of your saved results, open your table and in the newly created cluster column at the far right, rename it PAM1 (right click on headings>>properties).
  4. Now try adding more variables. Add MED.HH.INC and P.SFDH (percent single family detached homes) Run a bunch of clusterings with this set of variables using different cluster numbers between 3 and 6. When you find the best model, again check the “cluster membership” box and save the cluster output in your table (change the name of the heading to PAM2). Leave everything else as the default. [Q3] What is the number of classes that maximizes the silhouette score (note, there may be more than one cluster number that does this, in which case report one or all of those)? Present the silhouette score and screencapture the silhouette plot for that clustering. If you want, feel free to save more cluster membership columns for other models.
  5. Let’s plot this out in Arc Map. Using the approach you learned in previous labs, export the BG.GF.Census to a file, keeping only the BG.KEY and the two cluster membership columns (Filter tab) and saving as a dbf. Then, in ArcGIS, join to your Gwynns Falls block group layer (BG_GF_Census or BG_GF_LC) and plot out both using unique values symbology (remember to click “add all values”). Use a high-contrast color palette, like this . Make a screencapture of each plot.

 

 

  1. Now we’ll do clustering in SPSS using their snazzy “two step cluster analysis.” This method is better than K means for a number of reasons: 1. can create clusters based on categorical and numberic data; 2) can automatically select number of clusters based on optimization if you want, 3) is efficient with large data sets; 4) uses multi-model inferential statistics to select best number of clusters (AIC or BIC). Start by downloading a new data set from the share drive, called BGforSPSS.sav, located in the Data_2006\NR245\SPSS folder (note that you’ll download it to your NR245 folder using Windows Explorer). Although it’s very similar to BG.GF.Census, it has some slight differences (in number of records) that SPSS is very sensitive to, so please use this file. Open PASW Statistics (SPSS) 18.  Under the “What would you like to do screen” make sure the Open an Existing Data Source radio button is checked and click OK. Then browse to your BGforSPSS.sav file. On the resulting screen go to Analysis>>Classify>>Two step cluster.  Note that very good support on using clustering in SPSS can be found at http://www2.chass.ncsu.edu/garson/PA765/cluster.htm.

 

  1. Let’s start with a simple clustering algorithm: let’s cluster based on percent coarse veg, percent fine veg, and income (MED_HH_INC) by block group. Shift click those variables in the left window and click the right arrow to bring them into the “continuous variables” box. Check Log-likelihood as the distance measure, choose “determine automatically” as the number of clusters (15 max) and choose Akaike’s Information Criterion under the clustering criterion (this is less conservative than BIC and will generally result in more classes). It should look something like:

 

Click OK. Now let’s look at the output. Double click the box with the output that says “Model summary.” It will open a window. Q4. First, report the number of clusters and report the percent of cases and number of cases in each cluster (hover the mouse over the pie slices to get number). Then, in the right window, change “view” to predictor importance Q5. Report which is the most important predictor of cluster membership. Then, in the left window, change the “view” to “clusters” and click on the icon above for “copy visualization data”. Then paste that visualization into your homework and (Q6) report the percent coarse veg for each cluster. Next, click on one of the P.coarseveg boxes and screencapture the histogram in the right window that results. Do the same for one of the MED.HH.Inc boxes.  Q7. Describe what these histograms are telling you. Finally, find the cluster with the highest median income and click on its heading in the top row of the table in the left window. In the right window you should see a “cluster comparison.” Paste that into your homework and [Q8] describe what it is telling you. For instance, what does the cluster comparison show you about the income of this cluster group?

 

 

  1. Now we’ll add some more variables and see what happens. Do another two step cluster analysis. Do everything the same except change the variables. Get rid of p.fine.veg, but keep MED.HH.INC and P.coarse.veg. Also add P.SFDH, d2down, and Robb05. Then, add “DESC.15” (that’s PRIZM 15 with names) to the “categorical variable” list.’ Hence, we’ll be clustering based on continuous and categorical variables. Now look at the output. Q9. Report: a)the number of clusters, b)the most important predictor in terms of clustering, c) the average income of the highest income cluster, d) the number of cases in that cluster, and e)the average p.coarseveg value for that high income cluster. Then, for that same cluster, click on the DESC.15 box in the left window under that cluster's column and take a screencapture of the resulting graph in the right window. Q10. Report which PRIZM group is most represented in this high income cluster and which is most represented for the entire sample of clusters (pay attention to the legend, which tells you the difference between light and dark bars ).

 

9.      Assemble your text and screencaptures and make a PDF to upload.