**Lab 8**

Cluster Analysis of Geographic Data

NR245

- Open up S Plus. Open the object browser with the button. Load the BG_GF_CENSUS2 table you used last in lab 6 (If you’re not sure which one it is, go back to Data_2006\NR245\download.mdb and download BG_GF_Census2 --this is the same as BG_GF_Census, but it has the stream density variable).
- Now
we’ll try a simple cluster analysis. Click statistics>>Cluster
Analysis>>K means.
. You should now get an interface where you can choose the parameters for
the model. We’ll start with a cluster analysis based on two variables:
percent tree (P.coarse.veg) and population density (POP00.SQMI). You can
do this by control-clicking the variables you want like so. Now select the
number of clusters as 4. Keep the number of Max iterations at 10. The
iteration refers to the fact that k means works by classifying
observations into groups iteratively, by calculating group centroids, then
assigning observations to the groups with the nearest centroid, then recalculating
centroids. It does this, based on n-dimensional distance, by calculating
the centroid of each group and assigning each observation to a group with
the closest centroid. It must do this iteratively because it can’t know
where the centroid of a group is until it knows who’s in the group—a bit
of a paradox. So it calculates the centroid based on a current group
membership (which might start off being quite wrong), then recalculates by
reassigning observations to groups based on the new centroids, and so on.
Later on you can play with changing this parameter. The more iterations
the more refined group membership estimates become, which can become
important when you have many dimensions.
**[Q1] From the model output, say how many observations there are in each cluster. Also paste the small table showing mean value by cluster of percent forest and population density into your results. Give a very brief narrative description of each class (e.g. “low tree cover and medium population density”).** - Next,
let’s try the “partitioning around medoids” method. This method uses
“medoids” rather then centroids. It is more robust to inclusion of
outliers than k means because medoids use a more robust measure of
dispersion. Medoids are also better for using ordinal variables or
ratio-scaled variables, although we’re not doing that here. Most
importantly, it allows for some diagnostics that can help you determine
the strength of the structure of each clustering algorithm you try. In
particular, it can help you assess if you’re using the right number of
clusters and the right variables to define the clusters. This time we
won’t save cluster memberships until we have a better idea of what makes a
good cluster analysis. Before starting, however, we have to do a little
data management. Open your BG.GF.Census2 table and right click in header
of the first blank column to the right and click “insert column.” Under
name put coarseveg and under fill expression put “P.coarseveg*100”
(spelling may vary) and under column type put integer. Also, under
H2ODENS, look for an NA value and if you have one, change that to a 0
(otherwise, you won’t be able to append the cluster membership column in
this step). Then, go to statistics>>cluster analysis and choose
partitioning around medoids. We’re going to see if we can create a
clustering that will differentiate block groups based on different
dimensions of tree cover: amount of tree cover, whether it’s from
residential areas or riparian/park areas, whether it’s from areas where
there are lots of single family homes or not. Note that as we add more variables, we
expect to need more classes to maintain fit, because there are more
dimensional combinations to account for. As we add variables, we expect
the fit to go down, so we can’t just go based on fit alone—if we can add
variables and have only minimal losses to fit, then that’s a really good
sign. Open the partitioning around medoids tool in Splus. Shift-click to
select coarseveg and H2Odens.
Accept the defaults, except choose 3 as the number of clusters.
Under plots tab check the cluster plot and silhouette plot. Take a look at
the silhouette score on the graph. Now try the same thing for 4 through 10
clusters and
**[Q2] report which number of clusters yielded the highest silhouette score and what that score was. Interpret what this silhouette plot and clusplot are telling you about this clustering, making sure to address what signs are you looking for in both plots to indicate strong clustering or weak clustering. How does the average silhouette score relate to the silhouette plot? Finally, do a screencapture of the silhouette plot and clusplot for the clustering with the highest silhouette score.**Now we’ll rerun this clustering, saving the cluster membership in your table. Go to the Results tab of the clustering interface and check “cluster membership”, saving it in BG_GF_Census2. Click apply. Just to keep track of your saved results, open your table and in the newly created cluster column at the far right, rename it PAM1 (right click on headings>>properties). - Now
try adding more variables. Add MED.HH.INC and P.SFDH (percent single
family detached homes) Run a bunch of clusterings with this set of
variables using different cluster numbers between 3 and 6. When you find
the best model, again check the “cluster membership” box and save the
cluster output in your table (change the name of the heading to PAM2).
Leave everything else as the default.
**[Q3] What is the number of classes that maximizes the silhouette score (note, there may be more than one cluster number that does this, in which case report one or all of those)? Present the silhouette score and screencapture the silhouette plot for that clustering.**If you want, feel free to save more cluster membership columns for other models. - Let’s
plot this out in Arc Map. Using the approach you learned in previous labs,
export the BG.GF.Census to a file, keeping only the BG.KEY and the two
cluster membership columns (Filter tab) and saving as a dbf. Then, in
ArcGIS, join to your Gwynns Falls block group layer (BG_GF_Census or
BG_GF_LC) and plot out both using unique values symbology (remember to
click “add all values”). Use a high-contrast color palette, like this .
**Make a screencapture of each plot.**

- Now we’ll do clustering in SPSS using their snazzy “two step cluster analysis.” This method is better than K means for a number of reasons: 1. can create clusters based on categorical and numberic data; 2) can automatically select number of clusters based on optimization if you want, 3) is efficient with large data sets; 4) uses multi-model inferential statistics to select best number of clusters (AIC or BIC). Start by downloading a new data set from the share drive, called BGforSPSS.sav, located in the Data_2006\NR245\SPSS folder (note that you’ll download it to your NR245 folder using Windows Explorer). Although it’s very similar to BG.GF.Census, it has some slight differences (in number of records) that SPSS is very sensitive to, so please use this file. Open PASW Statistics (SPSS) 18. Under the “What would you like to do screen” make sure the Open an Existing Data Source radio button is checked and click OK. Then browse to your BGforSPSS.sav file. On the resulting screen go to Analysis>>Classify>>Two step cluster. Note that very good support on using clustering in SPSS can be found at http://www2.chass.ncsu.edu/garson/PA765/cluster.htm.

- Let’s start with a simple clustering algorithm: let’s cluster based on percent coarse veg, percent fine veg, and income (MED_HH_INC) by block group. Shift click those variables in the left window and click the right arrow to bring them into the “continuous variables” box. Check Log-likelihood as the distance measure, choose “determine automatically” as the number of clusters (15 max) and choose Akaike’s Information Criterion under the clustering criterion (this is less conservative than BIC and will generally result in more classes). It should look something like:

Click OK. Now let’s look at the
output. Double click the box with the output that says “Model summary.” It will
open a window. **Q4. First, report the
number of clusters and report the percent of cases and number of cases in each
cluster (hover the mouse over the pie slices to get number). **Then, in the
right window, change “view” to predictor importance** Q5. Report which is the most important predictor of cluster
membership. **Then, in the left window, change the “view” to “clusters” and
click on the icon above for “copy visualization data”. **Then paste that visualization into your homework and (Q6) report the
percent coarse veg for each cluster. **Next, click on one of the P.coarseveg
boxes** and screencapture the histogram in
the right window that results. Do the same for one of the MED.HH.Inc
boxes. Q7. Describe what these
histograms are telling you. **Finally, find the cluster with the highest
median income and click on its heading in the top row of the table in the left
window. In the right window you should see a “cluster comparison.” **Paste that into your homework and [Q8]
describe what it is telling you. For instance, what does the cluster comparison
show you about the income of this cluster group?**

- Now
we’ll add some more variables and see what happens. Do another two step
cluster analysis. Do everything the same except change the variables. Get
rid of p.fine.veg, but keep MED.HH.INC and P.coarse.veg. Also add P.SFDH,
d2down, and Robb05. Then, add “DESC.15” (that’s PRIZM 15 with names) to
the “categorical variable” list.’ Hence, we’ll be clustering based on
continuous and categorical variables. Now look at the output.
**Q9. Report: a)the number of clusters, b)the most important predictor in terms of clustering, c) the average income of the highest income cluster, d) the number of cases in that cluster, and e)the average p.coarseveg value for that high income cluster.**Then, for that same cluster, click on the DESC.15 box in the left window under that cluster's column and**take a screencapture of the resulting graph in the right window. Q10. Report which PRIZM group is most represented in this high income cluster and which is most represented for the entire sample of clusters (pay attention to the legend, which tells you the difference between light and dark bars ).**

**9. ****Assemble
your text and screencaptures and make a PDF to upload.**

** **