Lab 8

Cluster Analysis of Geographic Data

NR245

 

 

  1. Open up S Plus. Open the object browser with the OB button. Load the BG_GF_CENSUS2 table you used last in lab 6.
  2. Now we’ll try a simple cluster analysis. Click statistics>>Cluster Analysis>>K meansca. . You should now get an interface where you can choose the parameters for the model. We’ll start with a cluster analysis based on four land cover variables, percent tree (P.coarse.veg)), percent grass (p.fine veg), percent pavement (p.pavement) and stream density (H2Odens). You can do this by control-clicking the variables you want like so. Now select the number of clusters as 3. Keep the number of Max iterations at 10. The iteration refers to the fact that k means works by classifying observations into groups iteratively, by calculating group centroids, then assigning observations to the groups with the nearest centroid, then recalculating centroids. It does this, based on n-dimensional distance, by calculating the centroid of each group and assigning each observation to a group with the closest centroid. It must do this iteratively because it can’t know where the centroid of a group is until it knows who’s in the group—a bit of a paradox. So it calculates the centroid based on a current group membership (which might start off being quite wrong), then recalculates by reassigning observations to groups based on the new centroids, and so on. Later on you can play with changing this parameter. The more iterations the more refined group membership estimates become, which can become important when you have many dimensions. Now, from the model output (should be in a report), say how many observations there are in each cluster. Also paste the small table showing mean value by cluster of percent forest and med density residential into your results. Does it look like each cluster really has very different combos of values?  

STEPS 3 and 4 are optional, for those who can’t do the SPSS steps below and count for extra credit.Step 5 is just pointing you to information on Partitioning on Medoids and is only for your information

  1. Now try some different combinations of k-means and store the results. This time, each time you run one, under the Results tab check the “cluster membership” butt and choose to “save in” your BG-GFCensus2 layer. First try the same variables as in the last step but with 6 clusters. Then try using a mix of land cover and socio-economic variables instead of just land cover. Choose the p.coarse.veg, p.pavement, MED.HH.INC, and P.SFDH (percent single family detached homes) and MED.YR.ALL (median year of housing construction). Try this with 3 and 6 clusters. As you do these, keeps notes on what you did for each clustering and the order you did them in--you'll need that for the next step
  2. Now look at the table for your data in SPlus—you should see a bunch of new columns at the end for cluster membership. Right click on each column heading and rename them based on the order you did them (the first one you did are to the  left, the last ones to the right).  Call the field headings something that allows you to recognize them, like for instance, K3SocEco for k-means 3-cluster analysis with both socio-economic and ecological variables. Then go to export data>>to file. Under the filter tab choose “preview column names and choose only your cluster membership columns and the BKG_KEY field. Then output it to a new dbf table. Close S plus and open Arc Map. Load up your BGGFCensus layer and do a tabular join with the new dbf table. Then plot out three of the different clustering by block group using categorical mapping and screencapture each, describing what variables were used to create each clustering in the caption.
  3. In the past we used to use the Partitioning Around Medoids function in S-Plus for this exercise. However, as of release 8.0, the Silhouette plot in PAM (the most important output) has an error and it doesn’t work. To learn about how PAM worked click here for the old part of the lab.  Instead, we’ll do clustering in SPSS. First, you’ll export your BG_GF_Census2 file to SPSS format. Click file>>export data>to file and save that file as file type SPSS data file—windows OS.

 

  1. Now we’ll do clustering in SPSS using their snazzy “two step cluster analysis.” This method is better than K means for a number of reasons: 1. can create clusters based on categorical and numberic data; 2) can automatically select number of clusters based on optimization if you want, 3) is efficient with large data sets; 4) uses multi-model inferential statistics to select best number of clusters (AIC or BIC). To use it open SPSS. Then go to file>>open>>data and browse to the layer you just saved. On the resulting screen go to Analysis>>Classify>>Two step cluster.  Note that very good support on using clustering in SPSS can be found at http://www2.chass.ncsu.edu/garson/PA765/cluster.htm.

 

  1. Let’s start with a simple clustering algorithm: let’s cluster based on percent coarse veg, percent fine veg, and income by block group. Shift click those variables in the left window and click the right arrow to bring them into the “continuous variables” box. Check Log-likelihood as the distance measure, choose “determine automatically” as the number of clusters (15 max) and choose Akaike’s Information Criterion under the clustering criterion (this is less conservative than BIC and will generally result in more classes). It should look something like:

 

Then click “output” and  check the first three boxes under “statistics”. Hit continue and then click “plots”. Check “within cluster percentage chart, cluster pie chart and rank variable importance. Under Rank variables check the “by cluster” radio button and under the importance measure check the chi-square or t-test radio button. Then check the confidence level box (should say 95 under percentage). Click continue and then at the original interface click OK.

 

  1. Now let’s look at the output. First, look at the auto-clustering table. This is how it chooses the optimum number of clusters. Basically it’s looking for the number that results in the best combination of low (but not necessarily the lowest) AIC, high ratio of distance measures and high ratio of AIC changes. If you look to the next table (cluster distribution) it will show you that four clusters were chosen. Going back to the first table, report what the AIC, Ratio of AIC change and ratio of distance measures were for the four cluster model. Next look at the “cluster distribution” table and report how many observations there were per cluster. Then check out the cluster profiles table. The first thing you might notice is that the standard deviation numbers appear “starred out.” This is because there are too many decimal paces. So, we’ll need to edit the table a little. Double click on the box surrounding the table. Then control-click all the standard deviation cells so they are highlighted like this.

Then click format>>cell properties and under decimals choose 3. Now you should have a table with all the numbers showing. Take a screencapture (note that in SPSS you don’t need to use Snagit to screencapture. Just right click on an object and click “copy” and then paste into Word) and interpret what the mean and standard deviation is telling you in general terms for each cluster. Based on this, report and discuss the difference between cluster 1 and cluster 4. In what way are they distinctly different? Next, screencapture in the simulataneous 95% confidence interval for means plot for P.fineveg. Interpret what this plot is saying about what is the same and different and how you know from it. Finally, look at the clusterwise importance graphs. Screencapture the one for P.fine.veg and interpret it. What do the dotted lines signify and which values cross them? What is the significance of a negative vs. positive significant t statistic.  Note also that you should have the cluster membership for each observation now stored in the table as the last column, called something like “TSC_XXXX.” You can change the name of that by clicking on the “variable view” tab at the lower left and scrolling down to the last variable. There you can change the name. Call it cluster1. Then go back to the “data view” tab.

 

  1. Now we’ll add some more variables and see what happens. Do another two step cluster analysis. Do everything the same except change the variables. Get rid of p.fine.veg, but keep MED.HH.INC and P.coarse.veg. Also add P.SFDH, d2down, and Robb05. Then, add “DESC.15” (that’s PRIZM 15 with names) to the “categorical variable” list. Under “plots” keep everything the same, except change “rank variable from ‘by cluster’ to ‘by variable.’ Hence, we’ll be clustering based on continuous and categorical variables. Now look at the output. As you’ll see, the number of clusters is five. Report what the ratio of distance measures and ratio of AIC changes is for that number of clusters (see the auto-clustering table). Next report what percentage of observations is in each class (cluster distribution table). Next look at the cluster profiles table. Again change the number of decimal places for the standard deviation cells so you can see them, instead of stars. However, you’ll note that you can’t really see all the variables there because the table is so wide, so you’ll need to pivot it. Double click on the frame around the table. That should bring up a new window that says SPSS Pivot Table. Then click Pivot>>transpose rows and columns then File>>close. (if that window doesn’t open when you double click, then double click on the table frame and from the menu above it click Pivot>>transpose rows and columns. Now screencapture that cluster profile table. Use this table to interpret the difference between clusters 2 and 5. You’ll note that they have very similar incomes but are different in many other ways. Explain those differences. Next, you’ll see a Frequencies table that is specific to the PRIZM 15 classes. Pivot it, screencapture and explain what it’s saying. Next scroll down to the “within cluster percentage” plot. Screencapture and explain what each bar means. Based on this graph alone (using the names in the legend for guidance), what kind of areas would you say are characterized by cluster 1? Scroll down now to the “Within cluster variation” graph for P.coarse.veg a screencapture it. Describe which cluster has the highest coarse veg levels and which has the lowest. You can basically ignore the “categorical variablewise importance” plots and scroll down to “Continuous Variablewise Importance” plots. Look at the plot for Cluster 1 (note that the last time these were organized by variable and not cluster). Screencapture and describe what this plot seems to say about the characteristics of cluster 1. Make sure to discuss the relevance of negative versus positive t statistics. Does your assessment of cluster 1 appear consistent with what the PRIZM results above suggested about cluster 1? Then go down next to cluster 2. Report which variables are statistically significant and which are not and how you know that. Based on what you’ve seen in this output, come up with a name (kind of like the PRIZM names) for each of your clusters.

 

  1. Now export the table to your geodatabase. From the table interface (not the results interface) click File>>save as. Choose the output type as dbaseIV(dbf). Then click on the variables button. Click drop all, then check BKG.KEY, cluster1 and cluster2. Click continue and then save. Open ArcMap and load up your latest BG_GF_Census layer. Then load the dbf you just saved and do a tabular join using BKG_KEY field. First plot out and screencapture (using categorical mapping) the cluster 1 variable. Then do the same for the cluster 2 variable, only in the symbology window type under the “label” field the name you gave for each class. Then screencapture the map with the legend.
  2. Assemble your text and screencaptures and make a PDF to upload.